Welcome to Unmute

What is Unmute?

Unmute is a complete system that allows text LLMs to listen and speak by wrapping them in Kyutai’s state-of-the-art speech models. Experience natural voice conversations with any text LLM you like.

Low Latency

Optimized STT and TTS models deliver ~450ms response time in production

Any LLM

Works with any OpenAI-compatible LLM server - vLLM, Ollama, OpenAI, Mistral

Real-time Streaming

Stream audio and text bidirectionally over WebSocket connections

Custom Voices

Clone voices from audio samples and customize character personalities

How It Works

Unmute orchestrates multiple AI services to create seamless voice conversations:

User Speaks

Audio is captured in the browser and streamed to the backend via WebSocket

Speech to Text

Kyutai STT transcribes speech in real-time with 6-token delay for low latency

LLM Responds

Your chosen text LLM generates a response based on the conversation history

Text to Speech

Kyutai TTS converts the response to speech with customizable voices

Audio Playback

Generated audio streams back to the browser for immediate playback

Key Components

Backend (Python/FastAPI)

The core orchestration layer that manages:

WebSocket connections with OpenAI Realtime API compatibility
Conversation state and chat history
Service coordination between STT, LLM, and TTS
Voice management and cloning

Located in unmute/main_websocket.py

Speech-to-Text (Kyutai STT 1B)

Low-latency speech recognition:

Model: Kyutai STT 1B (English/French)
Memory: ~2.5GB VRAM
Latency: 6-token delay for real-time transcription
Architecture: Transformer with 16 layers, 2048 d_model

Text-to-Speech (Kyutai TTS 1.6B)

Natural voice synthesis:

Model: Kyutai TTS 1.6B (English/French)
Memory: ~5.3GB VRAM
Features: Voice cloning from audio samples
Voices: 100+ community-donated voices available

LLM (Configurable)

Any OpenAI-compatible text model:

Default: Llama 3.2 1B Instruct (16GB config)
Recommended: Mistral Small 3.2 24B, Gemma 3 12B
Memory: 6.1GB VRAM minimum (model dependent)
Hosted via vLLM, Ollama, or external APIs

Frontend (Next.js)

Modern web interface:

Real-time audio capture and playback
WebSocket communication
Subtitles and debug mode (press ‘S’ and ‘D’)
Character selection and voice customization

Deployment Options

Docker Compose

Recommended - Single GPU, single machine, very easy setup

Dockerless

Manual service management for 1-3 GPUs across 1-5 machines

Docker Swarm

Production scaling for 1-100 GPUs (used by unmute.sh)

Model Architecture

STT Model Configuration

The speech-to-text model uses a transformer architecture optimized for streaming:

[modules.asr.model.transformer]
d_model = 2048
num_heads = 16
num_layers = 16
dim_feedforward = 8192
causal = true
max_seq_len = 40960
asr_delay_in_tokens = 6  # Low latency configuration

TTS Model Configuration

The text-to-speech model supports voice cloning and multiple languages:

[modules.tts_py.py]
cfg_coef = 2.0          # Classifier-free guidance coefficient
n_q = 24                 # Number of quantization levels
padding_between = 1      # Token padding for prosody

Performance Metrics

On unmute.sh with separate GPUs for each service:

TTS Latency: ~450ms (vs. ~750ms on single L40S GPU)
Max Concurrent Users: 4 per backend instance (GIL constraint)
Model Memory: 16GB VRAM total (STT: 2.5GB, TTS: 5.3GB, LLM: 6.1GB+)

Try It Now

Experience Unmute live at unmute.sh or deploy your own instance:

Quick Start

Get Unmute running locally in 5 minutes with Docker Compose

Requirements

Check hardware, software, and configuration prerequisites

Research & Development

Research Paper

Read the academic paper on delayed streams modeling

Kyutai Models

Use Kyutai STT or TTS independently in your projects

Unmute requires:

GPU: CUDA-capable with 16GB+ VRAM
Architecture: x86_64 only (no aarch64 support)
OS: Linux or Windows with WSL (no native Windows or macOS)

Community

Unmute includes 100+ voices donated by the community through the Unmute Voice Donation Project (June 2025 - February 2026). These voices are available for use with Kyutai TTS and other open-source TTS models. Browse available voices in the voice repository.

Get Started

Deployment

Configuration

Welcome to Unmute

What is Unmute?

Low Latency

Any LLM

Real-time Streaming

Custom Voices

How It Works

Key Components

Deployment Options

Docker Compose

Dockerless

Docker Swarm

Model Architecture

STT Model Configuration

TTS Model Configuration

Performance Metrics

Try It Now

Quick Start

Requirements

Research & Development

Research Paper

Kyutai Models

Community

Build docs developers (and LLMs) love

Get Started

Deployment

Configuration

​What is Unmute?

Low Latency

Any LLM

Real-time Streaming

Custom Voices

​How It Works

​Key Components

​Deployment Options

Docker Compose

Dockerless

Docker Swarm

​Model Architecture

​STT Model Configuration

​TTS Model Configuration

​Performance Metrics

​Try It Now

Quick Start

Requirements

​Research & Development

Research Paper

Kyutai Models

​Community

Build docs developers (and LLMs) love

What is Unmute?

How It Works

Key Components

Deployment Options

Model Architecture

STT Model Configuration

TTS Model Configuration

Performance Metrics

Try It Now

Research & Development

Community