Data Flow & Timing

Conversation Flow

A typical conversation in Unmute follows this sequence:

Detailed Message Flow

1. Connection Setup

Timing: Connection setup typically takes 100-300ms

2. User Speaking

Key Timing:

Audio frame: 20ms (480 samples @ 24kHz)
STT delay: ~2.5s (configurable via STT_DELAY_SEC)
Time to first word: ~50-100ms after first audio frame

3. LLM Response Generation

Key Timing:

Time to first token (TTFT): 100-500ms (depends on model and GPU)
LLM streaming: ~50-200ms per word
Temperature: 0.7 (first message), 0.3 (subsequent)

4. TTS Audio Generation

Key Timing:

TTS TTFT: 200-750ms (depends on GPU and setup)
Audio buffer: 80ms ahead of real-time (AUDIO_BUFFER_SEC = 4 * 20ms)
Text/audio synchronization: Text released at start_s timestamp

5. User Interruption

Key Features:

Interruption cooldown: First 3s (VAD-based only)
VAD threshold: pause_prediction < 0.4
Interruption character: em-dash (—) stripped from LLM context

Conversation States

The backend manages conversation flow through three states: State Logic (unmute/llm/chatbot.py:21):

waiting_for_user: Last message is empty user message
user_speaking: Last message is non-empty user message
bot_speaking: Last message is assistant message

Timing Characteristics

Latency Breakdown (Typical)

Stage	Duration	Notes
User finishes speaking	0ms	Baseline
Pause detection	100-500ms	VAD threshold crossing
STT flush	2500ms	Configurable delay
LLM TTFT	100-500ms	First word generated
TTS TTFT	200-750ms	First audio chunk
Total to first audio	~3-4s	End-to-end latency

Throughput

Backend: 4 concurrent sessions per instance (configurable via MAX_CLIENTS)
STT: Multiple concurrent streams (capacity-based)
TTS: Multiple concurrent streams (capacity-based)
LLM: Batch processing via VLLM

Memory Usage

LLM: 6.1 GB VRAM (Llama 3.2 1B)
STT: 2.5 GB VRAM
TTS: 5.3 GB VRAM
Total: ~14 GB VRAM (single-GPU setup)

Real-Time Constraints

Audio Frames

Every 20ms:

Frontend captures 480 samples
Encodes to Opus
Sends over WebSocket
Backend decodes and forwards to STT
STT processes and returns word/VAD

Output Synchronization

The backend carefully manages audio release timing:

TTS generates audio faster than real-time (RTF > 1.0)
Audio queued with timestamps (RealtimeQueue)
Released exactly at start_s - AUDIO_BUFFER_SEC
Prevents stuttering and maintains sync

Frame Time Constant

FRAME_TIME_SEC = 0.02  # 20ms
SAMPLE_RATE = 24000    # 24kHz
SAMPLES_PER_FRAME = 480  # 20ms * 24kHz

Defined in unmute/kyutai_constants.py:19-21

Error Handling

Connection Failures

Retry with exponential backoff (50ms → 75ms → 112ms…)
Max retries: 5
User notification after exhaustion

Service Capacity

STT/TTS return Error message when at capacity
Backend catches MissingServiceAtCapacity exception
Returns error to frontend with retry suggestion

WebSocket Disconnects

Clean shutdown via CloseStream
Metrics recorded before disconnect
Resources released (STT, TTS, LLM connections)

Metrics Collection

Prometheus metrics tracked throughout the flow: Session Metrics:

unmute_sessions_total
unmute_active_sessions
unmute_session_duration_seconds

STT Metrics:

unmute_stt_ttft_seconds (time to first token)
unmute_stt_sent_frames_total
unmute_stt_recv_words_total

LLM Metrics:

unmute_vllm_ttft_seconds
unmute_vllm_request_length_words
unmute_vllm_reply_length_words

TTS Metrics:

unmute_tts_ttft_seconds
unmute_tts_audio_duration_seconds
unmute_tts_gen_duration_seconds

See unmute/metrics.py for complete list.

System Design

Core Components

Protocols

Data Flow & Timing

Conversation Flow

Detailed Message Flow

1. Connection Setup

2. User Speaking

3. LLM Response Generation

4. TTS Audio Generation

5. User Interruption

Conversation States

Timing Characteristics

Latency Breakdown (Typical)

Throughput

Memory Usage

Real-Time Constraints

Audio Frames

Output Synchronization

Frame Time Constant

Error Handling

Connection Failures

Service Capacity

WebSocket Disconnects

Metrics Collection

Build docs developers (and LLMs) love

System Design

Core Components

Protocols

​Conversation Flow

​Detailed Message Flow

​1. Connection Setup

​2. User Speaking

​3. LLM Response Generation

​4. TTS Audio Generation

​5. User Interruption

​Conversation States

​Timing Characteristics

​Latency Breakdown (Typical)

​Throughput

​Memory Usage

​Real-Time Constraints

​Audio Frames

​Output Synchronization

​Frame Time Constant

​Error Handling

​Connection Failures

​Service Capacity

​WebSocket Disconnects

​Metrics Collection

Build docs developers (and LLMs) love

Conversation Flow

Detailed Message Flow

1. Connection Setup

2. User Speaking

3. LLM Response Generation

4. TTS Audio Generation

5. User Interruption

Conversation States

Timing Characteristics

Latency Breakdown (Typical)

Throughput

Memory Usage

Real-Time Constraints

Audio Frames

Output Synchronization

Frame Time Constant

Error Handling

Connection Failures

Service Capacity

WebSocket Disconnects

Metrics Collection