Overview
The Text-to-Speech (TTS) component uses Kyutai’s streaming TTS models to synthesize natural-sounding speech from LLM-generated text in real-time with voice cloning support. Key Features:- Real-time streaming synthesis
- Voice cloning from 30-second audio samples
- Word-level timing synchronization
- WebSocket-based binary protocol (MessagePack)
- Configurable voice parameters (temperature, CFG scale)
Architecture
TTS Service
Technology: Rust (moshi-server) Location:services/moshi-server/
Model: Kyutai TTS 1.6B
Deployment: Docker container with GPU access
Service Configuration
Docker Compose (docker-compose.yml:55):
- VRAM: ~5.3 GB
- Concurrent streams: Limited by capacity management
Python Client
File:unmute/tts/text_to_speech.py
TextToSpeech Class
Connection Flow
Startup Sequence
File:text_to_speech.py:206
Sending Text
File:text_to_speech.py:181
Text Preprocessing
File:text_to_speech.py:97
Receiving Audio
File:text_to_speech.py:267
The TTS client is an async iterator with built-in timing control:
Message Types
Client → Server
Text Message
Voice Message
End-of-Stream Message
Server → Client
Ready Message
Text Message
Audio Message
Error Message
Real-Time Queue
File:unmute/tts/realtime_queue.py
Manages timed release of audio/text messages.
text_to_speech.py:94):
Voice Configuration
Query Parameters
File:text_to_speech.py:111
Voice Sources
File:voices.yaml
- Pre-defined: Voices from
voices.yaml(loaded from HuggingFace) - Custom: User-uploaded voice cloning (
custom:xyz123)
Voice Cloning
File:unmute/tts/voice_cloning.py
Integration with UnmuteHandler
Startup
File:unmute/unmute_handler.py:470
Message Loop
File:unmute/unmute_handler.py:508
LLM → TTS Pipeline
File:unmute/unmute_handler.py:184
Timing & Latency
Time to First Token (TTFT)
File:text_to_speech.py:286
- Single GPU: 750ms
- Dedicated GPU: 450ms
- Depends on: GPU model, model size, queue depth
Real-Time Factor
TTS generates audio faster than real-time:Synchronization
File:unmute/unmute_handler.py:523
Metrics
File:unmute/metrics.py
TTS-Specific Metrics
Error Handling
Connection Failures
Race Condition Handling
File:text_to_speech.py:226
Voice Donation
File:unmute/tts/voice_donation.py
System for collecting voice donations:
Next Steps
- LLM Integration - LLM text generation
- Speech-to-Text - STT component
- Frontend - Audio playback