Kyutai STT: Real-Time Transcription with Low Latency Streaming

Kyutai STT architecture showing Mimi and Moshi components for real-time transcription

Introduction

Speech‑to‑text (STT) technology is undergoing a revolution with the emergence of true streaming systems. Kyutai STT, an open‑source offering by Kyutai Labs, pioneers this shift using a novel “delayed‑streams modeling” approach—delivering simultaneous audio and text streams with built‑in semantic voice activity detection (VAD). In this blog post, we’ll explore what makes Kyutai STT revolutionary: its architecture, model variants, performance, and real-world applications.

What Is Kyutai STT?

Streaming Speech‑to‑Text Architecture
Unlike offline models, Kyutai STT transcribes audio as it’s spoken, chunk by chunk. This streaming capability enables real-time transcription with minimal buffering.

Delayed‑Streams Modeling
At the heart lies delayed‑streams modeling—a technique where audio and text streams run in parallel, with the text stream deliberately delayed (e.g., 0.5 s or 2.5 s lookahead). This synchronizes input with prediction, enabling fast and continuous transcription.

The model is available in variants like:

  • stt-1b-en_fr: bilingual English/French model
  • stt-2.6b-en: English-only, higher-capacity model

Kyutai also introduces a new audio codec, Mimi, along with Moshi, the language model component. Together, they reduce model latency and improve streaming accuracy.

How Kyutai STT Works

Kyutai’s streaming model relies on two core components:

Mimi (Audio Encoder)

Mimi is a low-latency audio codec that encodes raw waveforms into compressed audio tokens. It processes audio in real time and maintains high fidelity. Mimi operates at 24 kHz, framing audio efficiently for downstream models.

Moshi (Language Model)

Moshi is a transformer-based language model that decodes Mimi’s tokenized outputs into natural language text. It is fine-tuned to handle noisy, partially spoken inputs with minimal delay, using a custom decoding strategy called “delayed streams modeling.”

This setup enables seamless end-to-end transcription with a near-human level of responsiveness.

Technical Implementation and Architecture

Delayed Streams Modeling Explained

Kyutai STT’s capabilities are based on its innovative delayed streams modeling approach. This technique represents a fundamental departure from traditional encoder-decoder architectures used in models like Whisper. Instead of processing complete audio sequences and generating text, DSM treats audio and text as parallel, time-aligned streams.

In this architecture, audio and text streams exist “next to” each other rather than in sequence. The text stream is strategically padded and delayed by a few frames, providing the model with sufficient lookahead to make accurate predictions while maintaining real-time performance. During training, the model learns to predict both streams simultaneously, and during inference, the audio stream remains fixed while the model predicts the text stream in real-time.

This approach offers remarkable symmetry, with the potential for text-to-speech generation by simply reversing the delay pattern, keeping text fixed, and predicting audio. This architectural elegance suggests future possibilities for unified multimodal AI systems.

Multiple Implementation Options

Kyutai STT’s versatility is demonstrated through its multiple implementation options, each optimized for different use cases and environments:

PyTorch Implementation: Ideal for researchers and developers conducting experiments or integrating the model into Python-based applications. This implementation provides full access to the model’s capabilities with the flexibility of the PyTorch ecosystem.

Rust Server Implementation: Designed for production environments requiring maximum performance and reliability. The Rust server provides robust WebSocket-based streaming access and is the same implementation used in Unmute, Kyutai’s conversational AI application.

MLX Implementation: Optimized for Apple Silicon hardware, allowing Mac and iOS developers to leverage hardware acceleration for efficient on-device processing. This implementation opens possibilities for privacy-focused applications that keep speech processing entirely local.

Step-by-Step Tutorial: Run Kyutai STT Locally

This tutorial demonstrates how to run Kyutai’s stt-1b-en_fr model locally using Python. The code supports English and French audio transcription in real time.

1: Install Moshi

!pip install moshi

2: Download Sample Audio

!wget https://github.com/kyutai-labs/moshi/raw/refs/heads/main/data/sample_fr_hibiki_crepes.mp3

This is a French sample voice recording used for testing the transcription pipeline.

3: Import Required Libraries

import torch, time, textwrap, sentencepiece, sphn
from moshi.models import loaders, MimiModel, LMModel, LMGen

4: Define the Inference Pipeline

from dataclasses import dataclass

@dataclass
class InferenceState:
    mimi: MimiModel
    text_tokenizer: sentencepiece.SentencePieceProcessor
    lm_gen: LMGen

    def __init__(self, mimi, text_tokenizer, lm, batch_size, device):
        ...
        self.mimi.streaming_forever(batch_size)
        self.lm_gen.streaming_forever(batch_size)

    def run(self, in_pcms):
        ...
        return "".join(all_text)

5: Load Models and Audio

device = "cuda"
checkpoint_info = loaders.CheckpointInfo.from_hf_repo("kyutai/stt-1b-en_fr")

mimi = checkpoint_info.get_mimi(device=device)
text_tokenizer = checkpoint_info.get_text_tokenizer()
lm = checkpoint_info.get_moshi(device=device)

in_pcms, _ = sphn.read("sample_fr_hibiki_crepes.mp3", sample_rate=mimi.sample_rate)
in_pcms = torch.from_numpy(in_pcms).to(device=device)

6: Preprocess Audio

stt_config = checkpoint_info.stt_config
pad_left = int(stt_config.get("audio_silence_prefix_seconds", 0.0) * 24000)
pad_right = int((stt_config.get("audio_delay_seconds", 0.0) + 1.0) * 24000)

in_pcms = torch.nn.functional.pad(in_pcms, (pad_left, pad_right), mode="constant")
in_pcms = in_pcms[None, 0:1].expand(1, -1, -1)

7: Run the Inference

state = InferenceState(mimi, text_tokenizer, lm, batch_size=1, device=device)
text = state.run(in_pcms)
print(textwrap.fill(text, width=100))

This will output the transcribed French text.

8: Listen to the Audio (Optional)

from IPython.display import Audio
Audio("sample_fr_hibiki_crepes.mp3")

Key Features of Kyutai STT

  • Real-Time Streaming: Outputs transcriptions token by token
  • Multilingual: Supports both French and English
  • Open Source: Freely available for research and production
  • Transformer-Based: High accuracy with minimal delay

Performance Optimization and Scalability of Kyutai STT

The “Flush Trick” Innovation

Kyutai STT incorporates an innovative technique called the “flush trick” to further reduce response latency in interactive applications. When the voice activity detector identifies the end of speech, instead of waiting for the full model delay period, the system exploits the model’s ability to process audio faster than real-time.

By requesting accelerated processing of already-received audio, the system effectively “warps time,” reducing the typical 500ms delay to approximately 125ms (500ms/4x processing speed). This optimization makes interactions feel more natural and responsive, crucial for applications like Unmute where conversation flow is paramount.

Scalability Advantages

Traditional streaming speech-to-text solutions often face significant scalability challenges. Converting models like Whisper to streaming operation requires complex additional systems and doesn’t support efficient batching. Kyutai STT’s architecture elegantly solves these problems, allowing a single GPU to handle hundreds of concurrent streams without additional orchestration complexity.

Official GitHub: Click Here
Details blog: Click Here
Huggingface: Click Here

Conclusion

Kyutai STT is a breakthrough in live speech transcription, combining powerful encoding and decoding mechanisms for fast, multilingual performance. The open-source release empowers developers to build voice assistants, live captioning tools, and real-time transcription engines with minimal latency.

With its innovative architecture and strong benchmarks, Kyutai STT is poised to become a leading ASR solution in the era of streaming AI.

🚀 Want to know more about my journey in AI, tech tutorials, and digital exploration? Learn more about me here 👤 and follow my latest insights on Medium 📝 for in-depth articles, and feel free to connect with me on LinkedIn 🔗.

Website |  + posts

Md Monsur Ali is a tech writer and researcher specializing in AI, LLMs, and automation. He shares tutorials, reviews, and real-world insights on cutting-edge technology to help developers and tech enthusiasts stay ahead.