What is Unmute?
Unmute is a complete system that allows text LLMs to listen and speak by wrapping them in Kyutai’s state-of-the-art speech models. Experience natural voice conversations with any text LLM you like.Low Latency
Optimized STT and TTS models deliver ~450ms response time in production
Any LLM
Works with any OpenAI-compatible LLM server - vLLM, Ollama, OpenAI, Mistral
Real-time Streaming
Stream audio and text bidirectionally over WebSocket connections
Custom Voices
Clone voices from audio samples and customize character personalities
How It Works
Unmute orchestrates multiple AI services to create seamless voice conversations:Key Components
Backend (Python/FastAPI)
Backend (Python/FastAPI)
The core orchestration layer that manages:
- WebSocket connections with OpenAI Realtime API compatibility
- Conversation state and chat history
- Service coordination between STT, LLM, and TTS
- Voice management and cloning
unmute/main_websocket.pySpeech-to-Text (Kyutai STT 1B)
Speech-to-Text (Kyutai STT 1B)
Low-latency speech recognition:
- Model: Kyutai STT 1B (English/French)
- Memory: ~2.5GB VRAM
- Latency: 6-token delay for real-time transcription
- Architecture: Transformer with 16 layers, 2048 d_model
Text-to-Speech (Kyutai TTS 1.6B)
Text-to-Speech (Kyutai TTS 1.6B)
Natural voice synthesis:
- Model: Kyutai TTS 1.6B (English/French)
- Memory: ~5.3GB VRAM
- Features: Voice cloning from audio samples
- Voices: 100+ community-donated voices available
LLM (Configurable)
LLM (Configurable)
Any OpenAI-compatible text model:
- Default: Llama 3.2 1B Instruct (16GB config)
- Recommended: Mistral Small 3.2 24B, Gemma 3 12B
- Memory: 6.1GB VRAM minimum (model dependent)
- Hosted via vLLM, Ollama, or external APIs
Frontend (Next.js)
Frontend (Next.js)
Modern web interface:
- Real-time audio capture and playback
- WebSocket communication
- Subtitles and debug mode (press ‘S’ and ‘D’)
- Character selection and voice customization
Deployment Options
Docker Compose
Recommended - Single GPU, single machine, very easy setup
Dockerless
Manual service management for 1-3 GPUs across 1-5 machines
Docker Swarm
Production scaling for 1-100 GPUs (used by unmute.sh)
Model Architecture
STT Model Configuration
The speech-to-text model uses a transformer architecture optimized for streaming:TTS Model Configuration
The text-to-speech model supports voice cloning and multiple languages:Performance Metrics
On unmute.sh with separate GPUs for each service:
- TTS Latency: ~450ms (vs. ~750ms on single L40S GPU)
- Max Concurrent Users: 4 per backend instance (GIL constraint)
- Model Memory: 16GB VRAM total (STT: 2.5GB, TTS: 5.3GB, LLM: 6.1GB+)
Try It Now
Experience Unmute live at unmute.sh or deploy your own instance:Quick Start
Get Unmute running locally in 5 minutes with Docker Compose
Requirements
Check hardware, software, and configuration prerequisites
Research & Development
Research Paper
Read the academic paper on delayed streams modeling
Kyutai Models
Use Kyutai STT or TTS independently in your projects