Skip to main content

Overview

Unmute’s performance can be optimized across multiple dimensions: Time-to-First-Token (TTFT) latency, overall throughput, and resource utilization. This guide covers tuning strategies based on the production deployment at unmute.sh.

Key Metrics

Unmute tracks several critical performance metrics (defined in unmute/metrics.py):

Latency Metrics

  • STT TTFT: Time to first token from Speech-to-Text (target: less than 50ms)
  • LLM TTFT: Time to first token from Language Model (target: less than 200ms)
  • TTS TTFT: Time to first audio from Text-to-Speech (target: less than 450ms)
  • Ping Time: WebSocket round-trip latency (target: less than 100ms)

Throughput Metrics

  • Active Sessions: Concurrent user connections
  • Words per Second: STT/TTS/LLM processing rates
  • Realtime Factor: TTS generation speed vs playback speed (target: less than 1.0)

LLM Optimization

Model Selection

Default model: meta-llama/Llama-3.2-1B-Instruct (6.1GB VRAM) Recommended alternatives:
llm:
  command:
    [
      # Smaller, faster (lower latency)
      "--model=meta-llama/Llama-3.2-1B-Instruct",
      
      # Better quality, higher latency
      # "--model=mistralai/Mistral-Small-3.2-24B-Instruct-2506",
      # "--model=google/gemma-3-12b-it",
    ]
Trade-offs:
  • Smaller models: Lower latency, less memory, reduced quality
  • Larger models: Better responses, higher latency, more memory

Memory Configuration

llm:
  command:
    [
      "--model=meta-llama/Llama-3.2-1B-Instruct",
      
      # Context window size - affects conversation length
      "--max-model-len=1536",  # Reduce for lower memory usage
      
      # Precision setting
      "--dtype=bfloat16",  # Best balance of speed and accuracy
      
      # GPU memory allocation
      "--gpu-memory-utilization=0.4",  # Increase for better throughput
    ]
Tuning guidelines:
  • --gpu-memory-utilization: Increase from 0.4 to 0.7-0.9 if GPU is dedicated to LLM
  • --max-model-len: Reduce from 1536 to 1024 for shorter conversations, lower memory
  • --dtype: Use bfloat16 for best performance (requires Ampere+ GPUs)

Temperature Settings

Unmute uses different temperatures for varied responses (from unmute_handler.py):
FIRST_MESSAGE_TEMPERATURE = 0.7   # More creative conversation starters
FURTHER_MESSAGES_TEMPERATURE = 0.3  # More consistent follow-up responses
Lower temperature = more deterministic, faster generation.

TTS/STT Optimization

Service Configuration

TTS and STT services are configured via TOML files in services/moshi-server/configs/:
tts:
  command: ["worker", "--config", "configs/tts.toml"]
  
stt:
  command: ["worker", "--config", "configs/stt.toml"]

Volume Caching

Critical for fast startup and inference:
tts:
  volumes:
    - ./volumes/hf-cache:/root/.cache/huggingface
    - ./volumes/tts-target:/app/target  # Rust build cache
    - ./volumes/uv-cache:/root/.cache/uv  # Python package cache
    - /tmp/models/:/models  # Pre-downloaded models
Optimization: Pre-download models to /tmp/models/ to avoid startup delays.

Voice Cloning

Voice selection affects TTS latency. Voices are defined in voices.yaml:
- name: "Friendly AI"
  source:
    path_on_server: "voice-donations/Haku.wav"
  system_prompt: "You are a helpful AI assistant."
Use simpler voice samples for slightly faster processing.

Network Optimization

Audio Frame Size

From unmute_handler.py:
OUTPUT_FRAME_SIZE = 1920  # samples per frame
FRAME_TIME_SEC = 0.04     # 40ms frames
Note: Increasing OUTPUT_FRAME_SIZE may reduce overhead but increases latency. The default 480 samples is optimized to prevent choppy audio.

WebSocket Configuration

Unmute uses Opus encoding for efficient audio streaming:
writer = sphn.OpusStreamWriter(SAMPLE_RATE)  # 24kHz
reader = sphn.OpusStreamReader(SAMPLE_RATE)
Opus provides excellent compression without quality loss.

Scaling Configuration

Docker Compose (Single Machine)

For development/small deployments:
services:
  backend:
    # Single instance
  
  tts:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1

Docker Swarm (Production)

For high-load production deployments:
backend:
  deploy:
    replicas: 16  # Horizontal scaling
    resources:
      limits:
        cpus: "1.5"
        memory: 1G

tts:
  deploy:
    replicas: 8  # Multiple TTS workers
    placement:
      max_replicas_per_node: 1  # One per GPU

llm:
  deploy:
    replicas: 8
    placement:
      max_replicas_per_node: 1
Service discovery: Backend uses ws://tasks.tts:8080 to discover and load-balance across TTS replicas.

Load Testing

Use the built-in load testing tool to measure performance:
uv run unmute/loadtest/loadtest_client.py \
  --server-url ws://localhost:8000 \
  --n-workers 16 \
  --n-conversations 100
Metrics collected:
  • STT, VAD, LLM, and TTS latencies
  • Realtime factors (generation speed vs playback)
  • Success/failure rates
  • Percentile distributions (p90, p95)

Example Output

{
  "stt_latencies": {
    "mean": 0.032,
    "median": 0.028,
    "p90": 0.045,
    "p95": 0.052
  },
  "tts_start_latencies": {
    "mean": 0.412,
    "median": 0.398,
    "p90": 0.487,
    "p95": 0.521
  },
  "tts_realtime_factors": {
    "mean": 0.73,
    "median": 0.71,
    "p90": 0.89,
    "p95": 0.94
  }
}
Target values:
  • STT latency p95: less than 100ms
  • TTS start latency p95: less than 500ms
  • TTS realtime factor p95: less than 1.0 (faster than realtime)

Interrupt Handling

From unmute_handler.py:
UNINTERRUPTIBLE_BY_VAD_TIME_SEC = 3  # Prevents echo cancellation issues
USER_SILENCE_TIMEOUT = 7.0  # User inactivity timeout
Tuning:
  • Reduce UNINTERRUPTIBLE_BY_VAD_TIME_SEC for more responsive interrupts (may cause echo issues)
  • Adjust USER_SILENCE_TIMEOUT based on expected conversation pacing

Monitoring Performance

Unmute exposes Prometheus metrics on the backend service:
backend:
  labels:
    - "prometheus-port=80"
Key metrics to monitor:
  • worker_active_sessions: Current load
  • worker_stt_ttft: STT time-to-first-token distribution
  • worker_tts_ttft: TTS time-to-first-token distribution
  • worker_vllm_ttft: LLM time-to-first-token distribution
  • worker_tts_interrupt: Interrupt frequency
See Monitoring for full Prometheus/Grafana setup.

Production Optimizations

Based on unmute.sh deployment:

1. Multi-GPU Setup

  • Separate GPUs for STT, TTS, and LLM
  • Result: 40% latency reduction (750ms → 450ms)

2. Horizontal Scaling

  • 16 backend replicas for WebSocket handling
  • 8 TTS replicas for audio generation
  • 8 LLM replicas for text generation

3. Caching Strategy

  • Persistent volumes for model caches
  • Redis for session state (optional)
  • Pre-warmed model instances

4. Resource Limits

backend:
  deploy:
    resources:
      limits:
        cpus: "1.5"  # Prevents runaway containers
        memory: 1G

Common Bottlenecks

High TTS Latency

Causes:
  • Shared GPU with LLM
  • Large model contexts
  • Network latency
Solutions:
  • Dedicate GPU to TTS service
  • Reduce LLM --max-model-len
  • Use multi-GPU configuration

LLM Timeout

Causes:
  • Large context window
  • Complex prompts
  • GPU memory pressure
Solutions:
  • Reduce --max-model-len
  • Increase --gpu-memory-utilization
  • Use smaller/faster model

Poor Throughput

Causes:
  • Single backend instance
  • Insufficient GPU replicas
  • CPU bottlenecks
Solutions:
  • Scale backend replicas
  • Add more GPU nodes
  • Use Docker Swarm for horizontal scaling

Next Steps

Build docs developers (and LLMs) love