Performance Tuning

Overview

Unmute’s performance can be optimized across multiple dimensions: Time-to-First-Token (TTFT) latency, overall throughput, and resource utilization. This guide covers tuning strategies based on the production deployment at unmute.sh.

Key Metrics

Unmute tracks several critical performance metrics (defined in unmute/metrics.py):

Latency Metrics

STT TTFT: Time to first token from Speech-to-Text (target: less than 50ms)
LLM TTFT: Time to first token from Language Model (target: less than 200ms)
TTS TTFT: Time to first audio from Text-to-Speech (target: less than 450ms)
Ping Time: WebSocket round-trip latency (target: less than 100ms)

Throughput Metrics

Active Sessions: Concurrent user connections
Words per Second: STT/TTS/LLM processing rates
Realtime Factor: TTS generation speed vs playback speed (target: less than 1.0)

LLM Optimization

Model Selection

Default model: meta-llama/Llama-3.2-1B-Instruct (6.1GB VRAM) Recommended alternatives:

llm:
  command:
    [
      # Smaller, faster (lower latency)
      "--model=meta-llama/Llama-3.2-1B-Instruct",
      
      # Better quality, higher latency
      # "--model=mistralai/Mistral-Small-3.2-24B-Instruct-2506",
      # "--model=google/gemma-3-12b-it",
    ]

Trade-offs:

Smaller models: Lower latency, less memory, reduced quality
Larger models: Better responses, higher latency, more memory

Memory Configuration

llm:
  command:
    [
      "--model=meta-llama/Llama-3.2-1B-Instruct",
      
      # Context window size - affects conversation length
      "--max-model-len=1536",  # Reduce for lower memory usage
      
      # Precision setting
      "--dtype=bfloat16",  # Best balance of speed and accuracy
      
      # GPU memory allocation
      "--gpu-memory-utilization=0.4",  # Increase for better throughput
    ]

Tuning guidelines:

--gpu-memory-utilization: Increase from 0.4 to 0.7-0.9 if GPU is dedicated to LLM
--max-model-len: Reduce from 1536 to 1024 for shorter conversations, lower memory
--dtype: Use bfloat16 for best performance (requires Ampere+ GPUs)

Temperature Settings

Unmute uses different temperatures for varied responses (from unmute_handler.py):

FIRST_MESSAGE_TEMPERATURE = 0.7   # More creative conversation starters
FURTHER_MESSAGES_TEMPERATURE = 0.3  # More consistent follow-up responses

Lower temperature = more deterministic, faster generation.

TTS/STT Optimization

Service Configuration

TTS and STT services are configured via TOML files in services/moshi-server/configs/:

tts:
  command: ["worker", "--config", "configs/tts.toml"]
  
stt:
  command: ["worker", "--config", "configs/stt.toml"]

Volume Caching

Critical for fast startup and inference:

tts:
  volumes:
    - ./volumes/hf-cache:/root/.cache/huggingface
    - ./volumes/tts-target:/app/target  # Rust build cache
    - ./volumes/uv-cache:/root/.cache/uv  # Python package cache
    - /tmp/models/:/models  # Pre-downloaded models

Optimization: Pre-download models to /tmp/models/ to avoid startup delays.

Voice Cloning

Voice selection affects TTS latency. Voices are defined in voices.yaml:

- name: "Friendly AI"
  source:
    path_on_server: "voice-donations/Haku.wav"
  system_prompt: "You are a helpful AI assistant."

Use simpler voice samples for slightly faster processing.

Network Optimization

Audio Frame Size

From unmute_handler.py:

OUTPUT_FRAME_SIZE = 1920  # samples per frame
FRAME_TIME_SEC = 0.04     # 40ms frames

Note: Increasing OUTPUT_FRAME_SIZE may reduce overhead but increases latency. The default 480 samples is optimized to prevent choppy audio.

WebSocket Configuration

Unmute uses Opus encoding for efficient audio streaming:

writer = sphn.OpusStreamWriter(SAMPLE_RATE)  # 24kHz
reader = sphn.OpusStreamReader(SAMPLE_RATE)

Opus provides excellent compression without quality loss.

Scaling Configuration

Docker Compose (Single Machine)

For development/small deployments:

services:
  backend:
    # Single instance
  
  tts:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1

Docker Swarm (Production)

For high-load production deployments:

backend:
  deploy:
    replicas: 16  # Horizontal scaling
    resources:
      limits:
        cpus: "1.5"
        memory: 1G

tts:
  deploy:
    replicas: 8  # Multiple TTS workers
    placement:
      max_replicas_per_node: 1  # One per GPU

llm:
  deploy:
    replicas: 8
    placement:
      max_replicas_per_node: 1

Service discovery: Backend uses ws://tasks.tts:8080 to discover and load-balance across TTS replicas.

Load Testing

Use the built-in load testing tool to measure performance:

uv run unmute/loadtest/loadtest_client.py \
  --server-url ws://localhost:8000 \
  --n-workers 16 \
  --n-conversations 100

Metrics collected:

STT, VAD, LLM, and TTS latencies
Realtime factors (generation speed vs playback)
Success/failure rates
Percentile distributions (p90, p95)

Example Output

{
  "stt_latencies": {
    "mean": 0.032,
    "median": 0.028,
    "p90": 0.045,
    "p95": 0.052
  },
  "tts_start_latencies": {
    "mean": 0.412,
    "median": 0.398,
    "p90": 0.487,
    "p95": 0.521
  },
  "tts_realtime_factors": {
    "mean": 0.73,
    "median": 0.71,
    "p90": 0.89,
    "p95": 0.94
  }
}

Target values:

STT latency p95: less than 100ms
TTS start latency p95: less than 500ms
TTS realtime factor p95: less than 1.0 (faster than realtime)

Interrupt Handling

From unmute_handler.py:

UNINTERRUPTIBLE_BY_VAD_TIME_SEC = 3  # Prevents echo cancellation issues
USER_SILENCE_TIMEOUT = 7.0  # User inactivity timeout

Tuning:

Reduce UNINTERRUPTIBLE_BY_VAD_TIME_SEC for more responsive interrupts (may cause echo issues)
Adjust USER_SILENCE_TIMEOUT based on expected conversation pacing

Monitoring Performance

Unmute exposes Prometheus metrics on the backend service:

backend:
  labels:
    - "prometheus-port=80"

Key metrics to monitor:

worker_active_sessions: Current load
worker_stt_ttft: STT time-to-first-token distribution
worker_tts_ttft: TTS time-to-first-token distribution
worker_vllm_ttft: LLM time-to-first-token distribution
worker_tts_interrupt: Interrupt frequency

See Monitoring for full Prometheus/Grafana setup.

Production Optimizations

Based on unmute.sh deployment:

1. Multi-GPU Setup

Separate GPUs for STT, TTS, and LLM
Result: 40% latency reduction (750ms → 450ms)

2. Horizontal Scaling

16 backend replicas for WebSocket handling
8 TTS replicas for audio generation
8 LLM replicas for text generation

3. Caching Strategy

Persistent volumes for model caches
Redis for session state (optional)
Pre-warmed model instances

4. Resource Limits

backend:
  deploy:
    resources:
      limits:
        cpus: "1.5"  # Prevents runaway containers
        memory: 1G

Common Bottlenecks

High TTS Latency

Causes:

Shared GPU with LLM
Large model contexts
Network latency

Solutions:

Dedicate GPU to TTS service
Reduce LLM --max-model-len
Use multi-GPU configuration

LLM Timeout

Causes:

Large context window
Complex prompts
GPU memory pressure

Solutions:

Reduce --max-model-len
Increase --gpu-memory-utilization
Use smaller/faster model

Poor Throughput

Causes:

Single backend instance
Insufficient GPU replicas
CPU bottlenecks

Solutions:

Scale backend replicas
Add more GPU nodes
Use Docker Swarm for horizontal scaling

Next Steps

Multi-GPU Setup - Configure multiple GPUs
Monitoring - Track performance metrics
Debugging - Troubleshoot performance issues

Customization

Advanced

Performance Tuning

Overview

Key Metrics

Latency Metrics

Throughput Metrics

LLM Optimization

Model Selection

Memory Configuration

Temperature Settings

TTS/STT Optimization

Service Configuration

Volume Caching

Voice Cloning

Network Optimization

Audio Frame Size

WebSocket Configuration

Scaling Configuration

Docker Compose (Single Machine)

Docker Swarm (Production)

Load Testing

Example Output

Interrupt Handling

Monitoring Performance

Production Optimizations

1. Multi-GPU Setup

2. Horizontal Scaling

3. Caching Strategy

4. Resource Limits

Common Bottlenecks

High TTS Latency

LLM Timeout

Poor Throughput

Next Steps

Build docs developers (and LLMs) love

Customization

Advanced

​Overview

​Key Metrics

​Latency Metrics

​Throughput Metrics

​LLM Optimization

​Model Selection

​Memory Configuration

​Temperature Settings

​TTS/STT Optimization

​Service Configuration

​Volume Caching

​Voice Cloning

​Network Optimization

​Audio Frame Size

​WebSocket Configuration

​Scaling Configuration

​Docker Compose (Single Machine)

​Docker Swarm (Production)

​Load Testing

​Example Output

​Interrupt Handling

​Monitoring Performance

​Production Optimizations

​1. Multi-GPU Setup

​2. Horizontal Scaling

​3. Caching Strategy

​4. Resource Limits

​Common Bottlenecks

​High TTS Latency

​LLM Timeout

​Poor Throughput

​Next Steps

Build docs developers (and LLMs) love

Overview

Key Metrics

Latency Metrics

Throughput Metrics

LLM Optimization

Model Selection

Memory Configuration

Temperature Settings

TTS/STT Optimization

Service Configuration

Volume Caching

Voice Cloning

Network Optimization

Audio Frame Size

WebSocket Configuration

Scaling Configuration

Docker Compose (Single Machine)

Docker Swarm (Production)

Load Testing

Example Output

Interrupt Handling

Monitoring Performance

Production Optimizations

1. Multi-GPU Setup

2. Horizontal Scaling

3. Caching Strategy

4. Resource Limits

Common Bottlenecks

High TTS Latency

LLM Timeout

Poor Throughput

Next Steps