Kokoros: A 87M-Parameter TTS Engine in Rust That Achieves 1-2 Second Streaming Latency
Hook
Most production TTS systems measure latency in tens of seconds. Kokoros delivers first audio in under two seconds while running an 82-million parameter neural vocoder entirely in Rust with zero Python runtime overhead.
Context
Text-to-speech has traditionally forced an uncomfortable trade-off: either accept heavyweight Python-based solutions with superior quality but glacial startup times, or settle for fast but robotic-sounding C++ engines. The Kokoro model from hexgrad changed this calculus by delivering impressive voice quality from just 82M parameters, but the reference implementation still carried Python baggage. Kokoros bridges this gap by reimplementing the entire inference pipeline in Rust, eliminating the espeak-ng system dependency through a self-contained phonemizer, and adding production-ready features like OpenAI-compatible streaming APIs. The project targets developers building real-time applications—voice assistants, ASMR generators, digital humans—where sub-second latency matters more than supporting fifty languages. For English speech synthesis, Kokoros represents the first truly self-contained, dependency-light option that doesn’t sacrifice audio quality for speed.
Technical Insight
The architecture revolves around three components working in concert: a custom phonemizer, ONNX Runtime for model execution, and Opus encoding for efficient audio delivery. Unlike traditional TTS pipelines that shell out to espeak-ng for text preprocessing, Kokoros implements phonemization directly in Rust. This eliminates subprocess overhead and makes the entire pipeline embeddable in environments where installing system packages isn’t an option. The phonemizer converts raw text into phonetic representations that the neural vocoder expects, handling language-specific quirks without external tooling.
The model itself is served as a standard ONNX file (kokoro-v1.0.onnx), which you download via the provided scripts. Inference happens through ONNX Runtime’s Rust bindings, generating 24kHz audio samples. Installation is straightforward—on macOS you need pkg-config and opus via Homebrew, then bash download_all.sh pulls the model and voice data, and cargo build --release produces the koko binary. The CLI interface exposes both single-text and batch-file modes:
./target/release/koko text "Hello, this is a TTS test"
./target/release/koko file poem.txt
For production deployments, Kokoros provides an OpenAI-compatible HTTP server with streaming support. The streaming implementation chunks text intelligently to maintain natural prosody while achieving 1-2 second time-to-first-audio. Clients can request "stream": true in their API calls and receive audio chunks progressively, identical to how OpenAI’s TTS API behaves. This compatibility means you can drop Kokoros into existing infrastructure that already integrates with OpenAI’s endpoints with minimal code changes.
One particularly clever feature is style mixing, specified through the voice parameter: af_sky.4+af_nicole.5 blends two voice styles with weighted coefficients. This works because Kokoros represents voices as embeddings loaded from a binary file (voices-v1.0.bin). The mixing happens in embedding space before vocoding, allowing arbitrary combinations without retraining models. The resulting ASMR-like qualities shown in the repository’s demo videos come from these mixed embeddings creating novel prosodic characteristics.
The timestamps feature uses a specialized ONNX model variant that outputs word-level timing information alongside audio. Enabling --timestamps generates a TSV sidecar file with three columns: word, start_sec, end_sec. This capability is critical for applications needing lip-sync or captioning synchronization:
./target/release/koko text \
--output tmp/output.wav \
--timestamps \
"Hello from the timestamped model"
This produces both tmp/output.wav and tmp/output.tsv, with timings precise to milliseconds at the 24kHz sample rate. The timestamp model is a separate download from Hugging Face’s ONNX Community space, demonstrating Kokoros’ modularity in supporting different model variants.
Parallel processing infrastructure exists via the --instances flag on the HTTP server, spawning multiple inference instances to handle concurrent requests. However, benchmarks on Mac M2 show diminishing returns beyond two instances when running CPU-only. The 82M parameter count keeps single-instance throughput high enough that most deployments won’t saturate a modern CPU core. GPU acceleration isn’t yet exposed through the Rust bindings, though ONNX Runtime supports it in principle—this is where Python-based alternatives still hold an edge for batch processing workloads.
The codebase architecture separates concerns cleanly: phonemization logic is isolated in its own module, ONNX inference happens in a dedicated wrapper, and the HTTP server layer is OpenAI-protocol-aware but model-agnostic. This design makes it straightforward to swap in alternative models as they emerge—the roadmap mentions Zonos, Spark-TTS, and Orpheus-TTS integration, suggesting the maintainers view Kokoros as a general TTS runtime rather than a single-model implementation.
Gotcha
The multilingual story is weaker than the repository topics suggest. While English works impressively well, Chinese, Japanese, and German support is explicitly marked as “partly” in the README. The phonemizer was designed with English phonetics as the primary target, and extending it to handle non-Indo-European languages’ phonological systems is non-trivial. If you need production-grade Mandarin or Japanese synthesis, Coqui TTS or commercial options remain safer bets despite their performance overhead.
The project launched in January 2025, making it relatively new. The roadmap teasing Zonos, Spark-TTS, and Orpheus-TTS integration suggests active development but also hints at potential API instability. The OpenAI compatibility layer is marked as “still under polish” in the updates log, meaning breaking changes could arrive in minor version bumps. Early adopters should pin versions and expect to adjust integration code. Parallel processing shows promise but the benchmarks reveal CPU-bound limitations—scaling beyond two instances on M2 silicon yields minimal throughput gains, suggesting the current implementation doesn’t fully exploit modern multicore architectures. GPU inference support is conspicuously absent from the feature list, a significant gap for high-throughput batch processing where Python-based solutions can leverage CUDA or Metal acceleration.
Verdict
Use Kokoros if you’re building latency-sensitive applications in Rust where dependency minimization matters—embedded systems, CLI tools, or services that can’t tolerate Python runtime overhead. The self-contained phonemizer and 82M parameter efficiency make it ideal for edge deployment, and the OpenAI-compatible API provides a migration path from cloud TTS services. English voice quality punches well above its model size, and the streaming implementation’s sub-2-second latency enables real-time use cases that weren’t previously viable with open-source TTS. Skip it if you need battle-tested multilingual support beyond English, require GPU acceleration for batch processing, or can’t tolerate API churn in a young project. The partial Chinese/Japanese/German support isn’t ready for production, and organizations needing stability over bleeding-edge performance should stick with Coqui TTS or Piper despite their architectural trade-offs. Also pass if your infrastructure is Python-native anyway—the Rust rewrite’s benefits only materialize when the deployment environment makes dependency management painful.