Skip to main content

Conversation Flow

A typical conversation in Unmute follows this sequence:

Detailed Message Flow

1. Connection Setup

Timing: Connection setup typically takes 100-300ms

2. User Speaking

Key Timing:
  • Audio frame: 20ms (480 samples @ 24kHz)
  • STT delay: ~2.5s (configurable via STT_DELAY_SEC)
  • Time to first word: ~50-100ms after first audio frame

3. LLM Response Generation

Key Timing:
  • Time to first token (TTFT): 100-500ms (depends on model and GPU)
  • LLM streaming: ~50-200ms per word
  • Temperature: 0.7 (first message), 0.3 (subsequent)

4. TTS Audio Generation

Key Timing:
  • TTS TTFT: 200-750ms (depends on GPU and setup)
  • Audio buffer: 80ms ahead of real-time (AUDIO_BUFFER_SEC = 4 * 20ms)
  • Text/audio synchronization: Text released at start_s timestamp

5. User Interruption

Key Features:
  • Interruption cooldown: First 3s (VAD-based only)
  • VAD threshold: pause_prediction < 0.4
  • Interruption character: em-dash (—) stripped from LLM context

Conversation States

The backend manages conversation flow through three states: State Logic (unmute/llm/chatbot.py:21):
  • waiting_for_user: Last message is empty user message
  • user_speaking: Last message is non-empty user message
  • bot_speaking: Last message is assistant message

Timing Characteristics

Latency Breakdown (Typical)

StageDurationNotes
User finishes speaking0msBaseline
Pause detection100-500msVAD threshold crossing
STT flush2500msConfigurable delay
LLM TTFT100-500msFirst word generated
TTS TTFT200-750msFirst audio chunk
Total to first audio~3-4sEnd-to-end latency

Throughput

  • Backend: 4 concurrent sessions per instance (configurable via MAX_CLIENTS)
  • STT: Multiple concurrent streams (capacity-based)
  • TTS: Multiple concurrent streams (capacity-based)
  • LLM: Batch processing via VLLM

Memory Usage

  • LLM: 6.1 GB VRAM (Llama 3.2 1B)
  • STT: 2.5 GB VRAM
  • TTS: 5.3 GB VRAM
  • Total: ~14 GB VRAM (single-GPU setup)

Real-Time Constraints

Audio Frames

Every 20ms:
  1. Frontend captures 480 samples
  2. Encodes to Opus
  3. Sends over WebSocket
  4. Backend decodes and forwards to STT
  5. STT processes and returns word/VAD

Output Synchronization

The backend carefully manages audio release timing:
  • TTS generates audio faster than real-time (RTF > 1.0)
  • Audio queued with timestamps (RealtimeQueue)
  • Released exactly at start_s - AUDIO_BUFFER_SEC
  • Prevents stuttering and maintains sync

Frame Time Constant

FRAME_TIME_SEC = 0.02  # 20ms
SAMPLE_RATE = 24000    # 24kHz
SAMPLES_PER_FRAME = 480  # 20ms * 24kHz
Defined in unmute/kyutai_constants.py:19-21

Error Handling

Connection Failures

  • Retry with exponential backoff (50ms → 75ms → 112ms…)
  • Max retries: 5
  • User notification after exhaustion

Service Capacity

  • STT/TTS return Error message when at capacity
  • Backend catches MissingServiceAtCapacity exception
  • Returns error to frontend with retry suggestion

WebSocket Disconnects

  • Clean shutdown via CloseStream
  • Metrics recorded before disconnect
  • Resources released (STT, TTS, LLM connections)

Metrics Collection

Prometheus metrics tracked throughout the flow: Session Metrics:
  • unmute_sessions_total
  • unmute_active_sessions
  • unmute_session_duration_seconds
STT Metrics:
  • unmute_stt_ttft_seconds (time to first token)
  • unmute_stt_sent_frames_total
  • unmute_stt_recv_words_total
LLM Metrics:
  • unmute_vllm_ttft_seconds
  • unmute_vllm_request_length_words
  • unmute_vllm_reply_length_words
TTS Metrics:
  • unmute_tts_ttft_seconds
  • unmute_tts_audio_duration_seconds
  • unmute_tts_gen_duration_seconds
See unmute/metrics.py for complete list.

Build docs developers (and LLMs) love