> your AI agent picks dependencies from memory; give it dated facts — try starlog.dev ↗ vet your agent's deps ↗ vibe-coding is fine. vibe-importing isn’t. — try starlog.dev ↗ vibe-importing isn’t fine ↗ your agent has never seen your private packages — try starlog.dev ↗ facts for private packages ↗ a linter for the dependencies your AI agent picks — try starlog.dev ↗ a linter for agent deps ↗

Back to Articles

Kokoros: Rust-Powered TTS That Breaks the Real-Time Barrier

[ View on GitHub ]

Kokoros: Rust-Powered TTS That Breaks the Real-Time Barrier

Hook

Most text-to-speech systems take 5-15 seconds to generate audio. Kokoros does it in under 2 seconds—without a GPU—by ditching Python entirely and rebuilding TTS infrastructure in Rust.

Context

Text-to-speech has been stuck in a performance paradox for years. Modern neural TTS models like Tacotron 2, FastSpeech, and their descendants produce remarkably human-like speech, but they're agonizingly slow. The bottleneck isn't just model size—it's the entire Python-based inference stack. Loading dependencies, initializing runtimes, and managing the GIL create latency that makes real-time applications nearly impossible without expensive GPU acceleration.

The problem compounds when you need TTS in production systems: voice assistants that feel responsive, digital avatars that don't make users wait, or accessibility tools that keep pace with reading speed. Cloud APIs like ElevenLabs solve the latency problem by throwing server farms at it, but introduce network delays, usage costs, and privacy concerns. Self-hosted Python solutions like Coqui TTS work offline but struggle to hit real-time performance on CPU-only systems. Kokoros emerged from this gap—a reimplementation of the Kokoro-82M model in pure Rust, designed specifically to make high-quality TTS fast enough for interactive applications without requiring cloud connectivity or GPU hardware.

Technical Insight

Kokoros's architecture strips TTS down to three core components: phonemization, neural inference, and audio encoding. The entire pipeline is Rust-native, using ONNX Runtime for model execution and a built-in espeak-ng wrapper for text-to-phoneme conversion. The original Kokoro model is an 82-million parameter neural network trained to convert phoneme sequences into mel-spectrograms, which are then vocoded into audio. By keeping the model relatively small and using ONNX's optimized runtime, Kokoros achieves remarkably fast inference even on CPU.

The library exposes both synchronous and streaming APIs. Here's how you'd generate speech programmatically:

use kokoros::{Kokoros, KokorosConfig};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Initialize with default configuration
    let config = KokorosConfig::default();
    let tts = Kokoros::new(config)?;
    
    // Generate speech from text
    let audio_data = tts.synthesize(
        "Hello, this is Kokoros speaking at incredible speed.",
        "af_heart" // Voice style identifier
    )?;
    
    // audio_data is raw PCM at 24kHz
    std::fs::write("output.pcm", audio_data)?;
    Ok(())
}

The real innovation shows up in streaming mode, where Kokoros generates audio chunks as the model processes text, enabling time-to-first-audio under 2 seconds. The streaming implementation uses Rust's async ecosystem to pipeline phonemization, inference, and audio encoding:

use kokoros::{Kokoros, StreamConfig};
use tokio::io::AsyncWriteExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let tts = Kokoros::new(KokorosConfig::default())?;
    let mut stream = tts.synthesize_stream(
        "This sentence will start playing before it finishes generating.",
        "af_heart",
        StreamConfig::default()
    )?;
    
    let mut output = tokio::fs::File::create("streaming.opus").await?;
    
    while let Some(chunk) = stream.next().await {
        let audio_chunk = chunk?;
        output.write_all(&audio_chunk).await?;
        // Audio can be played immediately
    }
    
    Ok(())
}

The phonemizer integration is particularly clever. Rather than spawning external espeak-ng processes (which would destroy performance), Kokoros embeds espeak-ng bindings directly into the Rust binary. This eliminates process spawning overhead and enables the E2E inference mode where text goes in and audio comes out without touching disk or external executables. The phonemization step normalizes text variations—"Dr." becomes "doctor", "1990" becomes "nineteen ninety"—ensuring consistent model input.

Kokoros also implements style mixing, letting you blend multiple voice characteristics by interpolating between style embeddings. If you want 70% of voice A's warmth with 30% of voice B's clarity, you can specify weighted style combinations. This happens at the embedding level before inference, adding negligible overhead while dramatically expanding voice variety from a single model checkpoint.

The OpenAI-compatible server mode deserves special attention. It implements the /v1/audio/speech endpoint, making Kokoros a drop-in replacement for applications already using OpenAI's TTS API. This means you can self-host voice generation for existing tools without changing client code—just point the API base URL at your Kokoros instance. The server supports concurrent requests with per-request isolation, though parallel processing on CPU shows diminishing returns beyond 4-6 concurrent streams on typical hardware.

Word-level timestamps are generated via TSV sidecar files, mapping each word to its millisecond-precise position in the output audio. This enables applications like karaoke-style highlighting, subtitle generation, or precise audio editing without manual alignment. The timestamp generation adds minimal overhead because it's derived from the model's attention mechanism during inference rather than requiring separate alignment passes.

Gotcha

The multilingual story is messier than the documentation suggests. While Kokoros claims support for Chinese, Japanese, and German, the implementation is marked "partly" complete, and testing reveals inconsistent quality outside English. The phonemizer relies on espeak-ng language data, which varies widely in accuracy across languages. Chinese pinyin conversion works reasonably well, but tonal accuracy suffers. Japanese pronunciation is hit-or-miss with katakana loan words. If your application needs production-quality non-English TTS, you'll likely need to fall back to language-specific models or services like Azure's multilingual offerings.

Parallel processing performance is disappointing on CPU systems. The benchmarks on M2 Mac show that splitting inference across multiple threads provides minimal speedup—sometimes even slowing things down due to thread coordination overhead and memory bandwidth saturation. The ONNX Runtime's CPU execution provider doesn't parallelize well for this model architecture, which appears to be memory-bandwidth-bound rather than compute-bound. GPU support would help, but isn't currently prioritized. This limits horizontal scaling: you can't just throw more cores at the problem to handle higher request volumes. The project is also extremely young, with the initial commit in January 2025. Expect API changes, incomplete error handling in edge cases, and sparse documentation for advanced features. Early adopters should pin to specific commit hashes and be prepared to update integration code as the project matures.

Verdict

Use if: You're building Rust applications that need real-time voice synthesis (voice assistants, game NPCs, accessibility tools) where sub-2-second latency matters and English is your primary language. The OpenAI API compatibility makes it perfect for self-hosting cost-sensitive projects currently using cloud TTS. The streaming mode and built-in phonemizer are game-changers for responsive UIs. Skip if: You need robust multilingual support beyond English, require API stability for enterprise deployment, or work primarily in Python where ecosystem integration matters more than raw speed. For multilingual production work, stick with Piper TTS or cloud services. For Python projects prioritizing ease of use over performance, Coqui TTS remains more mature. And if you need state-of-the-art quality with emotion control and don't care about latency, Bark or ElevenLabs are better choices despite their speed/cost trade-offs.