Skip to main content
Unmute Hero Light

What is Unmute?

Unmute is a complete system that allows text LLMs to listen and speak by wrapping them in Kyutai’s state-of-the-art speech models. Experience natural voice conversations with any text LLM you like.

Low Latency

Optimized STT and TTS models deliver ~450ms response time in production

Any LLM

Works with any OpenAI-compatible LLM server - vLLM, Ollama, OpenAI, Mistral

Real-time Streaming

Stream audio and text bidirectionally over WebSocket connections

Custom Voices

Clone voices from audio samples and customize character personalities

How It Works

Unmute orchestrates multiple AI services to create seamless voice conversations:
1

User Speaks

Audio is captured in the browser and streamed to the backend via WebSocket
2

Speech to Text

Kyutai STT transcribes speech in real-time with 6-token delay for low latency
3

LLM Responds

Your chosen text LLM generates a response based on the conversation history
4

Text to Speech

Kyutai TTS converts the response to speech with customizable voices
5

Audio Playback

Generated audio streams back to the browser for immediate playback

Key Components

The core orchestration layer that manages:
  • WebSocket connections with OpenAI Realtime API compatibility
  • Conversation state and chat history
  • Service coordination between STT, LLM, and TTS
  • Voice management and cloning
Located in unmute/main_websocket.py
Low-latency speech recognition:
  • Model: Kyutai STT 1B (English/French)
  • Memory: ~2.5GB VRAM
  • Latency: 6-token delay for real-time transcription
  • Architecture: Transformer with 16 layers, 2048 d_model
Natural voice synthesis:
  • Model: Kyutai TTS 1.6B (English/French)
  • Memory: ~5.3GB VRAM
  • Features: Voice cloning from audio samples
  • Voices: 100+ community-donated voices available
Any OpenAI-compatible text model:
  • Default: Llama 3.2 1B Instruct (16GB config)
  • Recommended: Mistral Small 3.2 24B, Gemma 3 12B
  • Memory: 6.1GB VRAM minimum (model dependent)
  • Hosted via vLLM, Ollama, or external APIs
Modern web interface:
  • Real-time audio capture and playback
  • WebSocket communication
  • Subtitles and debug mode (press ‘S’ and ‘D’)
  • Character selection and voice customization

Deployment Options

Docker Compose

Recommended - Single GPU, single machine, very easy setup

Dockerless

Manual service management for 1-3 GPUs across 1-5 machines

Docker Swarm

Production scaling for 1-100 GPUs (used by unmute.sh)

Model Architecture

STT Model Configuration

The speech-to-text model uses a transformer architecture optimized for streaming:
[modules.asr.model.transformer]
d_model = 2048
num_heads = 16
num_layers = 16
dim_feedforward = 8192
causal = true
max_seq_len = 40960
asr_delay_in_tokens = 6  # Low latency configuration

TTS Model Configuration

The text-to-speech model supports voice cloning and multiple languages:
[modules.tts_py.py]
cfg_coef = 2.0          # Classifier-free guidance coefficient
n_q = 24                 # Number of quantization levels
padding_between = 1      # Token padding for prosody

Performance Metrics

On unmute.sh with separate GPUs for each service:
  • TTS Latency: ~450ms (vs. ~750ms on single L40S GPU)
  • Max Concurrent Users: 4 per backend instance (GIL constraint)
  • Model Memory: 16GB VRAM total (STT: 2.5GB, TTS: 5.3GB, LLM: 6.1GB+)

Try It Now

Experience Unmute live at unmute.sh or deploy your own instance:

Quick Start

Get Unmute running locally in 5 minutes with Docker Compose

Requirements

Check hardware, software, and configuration prerequisites

Research & Development

Research Paper

Read the academic paper on delayed streams modeling

Kyutai Models

Use Kyutai STT or TTS independently in your projects
Unmute requires:
  • GPU: CUDA-capable with 16GB+ VRAM
  • Architecture: x86_64 only (no aarch64 support)
  • OS: Linux or Windows with WSL (no native Windows or macOS)

Community

Unmute includes 100+ voices donated by the community through the Unmute Voice Donation Project (June 2025 - February 2026). These voices are available for use with Kyutai TTS and other open-source TTS models. Browse available voices in the voice repository.

Build docs developers (and LLMs) love