System Overview
Unmute is a real-time voice conversation system that allows text-based Large Language Models (LLMs) to listen and speak by wrapping them with Kyutai’s speech-to-text (STT) and text-to-speech (TTS) models. The architecture is designed for low latency and high throughput, enabling natural voice conversations.Architecture Diagram
Core Components
Frontend (Next.js)
- Technology: React, TypeScript, Next.js
- Location:
frontend/ - Responsibilities:
- User interface and voice selection
- Microphone capture and Opus encoding
- WebSocket communication with backend
- Real-time audio playback and visualization
- Subtitle rendering
Backend (FastAPI)
- Technology: Python, FastAPI, asyncio
- Location:
unmute/main_websocket.py,unmute/unmute_handler.py - Responsibilities:
- WebSocket orchestration
- Audio routing between STT, LLM, and TTS
- Conversation state management
- Turn-taking logic and interruption handling
- Metrics collection (Prometheus)
Speech-to-Text Service
- Technology: Rust, WebSocket, Kyutai STT models
- Location:
services/moshi-server/ - Responsibilities:
- Real-time speech transcription
- Voice Activity Detection (VAD)
- Pause prediction
LLM Server
- Technology: VLLM (default), OpenAI-compatible API
- Default Model: Llama 3.2 1B Instruct (configurable)
- Responsibilities:
- Text response generation
- Streaming completions
- Character personality implementation
Text-to-Speech Service
- Technology: Rust, WebSocket, Kyutai TTS models
- Location:
services/moshi-server/ - Responsibilities:
- Real-time speech synthesis
- Voice cloning support
- Audio streaming with timing synchronization
Communication Protocol
Unmute uses a WebSocket-based protocol inspired by the OpenAI Realtime API:- Frontend ↔ Backend: JSON messages over WebSocket (port 80/443)
- Backend ↔ STT: MessagePack binary protocol over WebSocket
- Backend ↔ LLM: HTTP streaming (Server-Sent Events)
- Backend ↔ TTS: MessagePack binary protocol over WebSocket
Audio Format
- Sample Rate: 24 kHz
- Channels: Mono
- Frame Size: 480 samples (20ms)
- Codec: Opus (for browser ↔ backend)
- Internal Format: PCM float32 (for processing)
Deployment Architectures
Single-GPU Docker Compose
- All services on one GPU
- Simplest deployment
- TTS latency: ~750ms on L40S
Multi-GPU Docker Compose
- STT, TTS, LLM on separate GPUs
- Improved latency
- TTS latency: ~450ms on L40S
Docker Swarm (Production)
- Multi-node deployment
- Service discovery
- Load balancing with Traefik
- Auto-scaling support
Key Design Decisions
Asynchronous Architecture
The entire backend is built on Python’sasyncio to handle multiple concurrent connections efficiently without blocking:
- Concurrent STT/TTS/LLM operations
- Non-blocking I/O for WebSocket connections
- Quest-based service management
Low Latency Optimizations
- Streaming: All services stream data incrementally
- Word-level chunking: LLM output rechunked to word boundaries for TTS
- Real-time queuing: TTS audio released at precise timestamps
- Frame-based processing: 20ms audio frames for minimal buffering
Interruption Handling
Users can interrupt the bot mid-sentence:- VAD-based detection (pause prediction < 0.4)
- STT word detection during bot speech
- Queue clearing and service cancellation
- Clean conversation state transition
Service Resilience
- Connection retry with exponential backoff
- Health checks for all services
- Graceful degradation
- Prometheus metrics for monitoring
Scalability
The architecture is designed to scale horizontally:- Backend: Multiple instances behind load balancer (4 concurrent clients per instance)
- STT/TTS: Service discovery with capacity management
- LLM: VLLM supports batching and KV cache
- Frontend: Static assets via CDN
Next Steps
- System Design - Detailed data flow and timing
- Core Components - Deep dive into each component
- Speech-to-Text - STT implementation details
- Text-to-Speech - TTS implementation details
- LLM Integration - LLM integration patterns