Conversation Flow
A typical conversation in Unmute follows this sequence:Detailed Message Flow
1. Connection Setup
Timing: Connection setup typically takes 100-300ms2. User Speaking
Key Timing:- Audio frame: 20ms (480 samples @ 24kHz)
- STT delay: ~2.5s (configurable via
STT_DELAY_SEC) - Time to first word: ~50-100ms after first audio frame
3. LLM Response Generation
Key Timing:- Time to first token (TTFT): 100-500ms (depends on model and GPU)
- LLM streaming: ~50-200ms per word
- Temperature: 0.7 (first message), 0.3 (subsequent)
4. TTS Audio Generation
Key Timing:- TTS TTFT: 200-750ms (depends on GPU and setup)
- Audio buffer: 80ms ahead of real-time (
AUDIO_BUFFER_SEC = 4 * 20ms) - Text/audio synchronization: Text released at
start_stimestamp
5. User Interruption
Key Features:- Interruption cooldown: First 3s (VAD-based only)
- VAD threshold: pause_prediction < 0.4
- Interruption character: em-dash (—) stripped from LLM context
Conversation States
The backend manages conversation flow through three states: State Logic (unmute/llm/chatbot.py:21):
waiting_for_user: Last message is empty user messageuser_speaking: Last message is non-empty user messagebot_speaking: Last message is assistant message
Timing Characteristics
Latency Breakdown (Typical)
| Stage | Duration | Notes |
|---|---|---|
| User finishes speaking | 0ms | Baseline |
| Pause detection | 100-500ms | VAD threshold crossing |
| STT flush | 2500ms | Configurable delay |
| LLM TTFT | 100-500ms | First word generated |
| TTS TTFT | 200-750ms | First audio chunk |
| Total to first audio | ~3-4s | End-to-end latency |
Throughput
- Backend: 4 concurrent sessions per instance (configurable via
MAX_CLIENTS) - STT: Multiple concurrent streams (capacity-based)
- TTS: Multiple concurrent streams (capacity-based)
- LLM: Batch processing via VLLM
Memory Usage
- LLM: 6.1 GB VRAM (Llama 3.2 1B)
- STT: 2.5 GB VRAM
- TTS: 5.3 GB VRAM
- Total: ~14 GB VRAM (single-GPU setup)
Real-Time Constraints
Audio Frames
Every 20ms:- Frontend captures 480 samples
- Encodes to Opus
- Sends over WebSocket
- Backend decodes and forwards to STT
- STT processes and returns word/VAD
Output Synchronization
The backend carefully manages audio release timing:- TTS generates audio faster than real-time (RTF > 1.0)
- Audio queued with timestamps (
RealtimeQueue) - Released exactly at
start_s - AUDIO_BUFFER_SEC - Prevents stuttering and maintains sync
Frame Time Constant
unmute/kyutai_constants.py:19-21
Error Handling
Connection Failures
- Retry with exponential backoff (50ms → 75ms → 112ms…)
- Max retries: 5
- User notification after exhaustion
Service Capacity
- STT/TTS return
Errormessage when at capacity - Backend catches
MissingServiceAtCapacityexception - Returns error to frontend with retry suggestion
WebSocket Disconnects
- Clean shutdown via
CloseStream - Metrics recorded before disconnect
- Resources released (STT, TTS, LLM connections)
Metrics Collection
Prometheus metrics tracked throughout the flow: Session Metrics:unmute_sessions_totalunmute_active_sessionsunmute_session_duration_seconds
unmute_stt_ttft_seconds(time to first token)unmute_stt_sent_frames_totalunmute_stt_recv_words_total
unmute_vllm_ttft_secondsunmute_vllm_request_length_wordsunmute_vllm_reply_length_words
unmute_tts_ttft_secondsunmute_tts_audio_duration_secondsunmute_tts_gen_duration_seconds
unmute/metrics.py for complete list.