Overview
Unmute’s performance can be optimized across multiple dimensions: Time-to-First-Token (TTFT) latency, overall throughput, and resource utilization. This guide covers tuning strategies based on the production deployment at unmute.sh.Key Metrics
Unmute tracks several critical performance metrics (defined inunmute/metrics.py):
Latency Metrics
- STT TTFT: Time to first token from Speech-to-Text (target: less than 50ms)
- LLM TTFT: Time to first token from Language Model (target: less than 200ms)
- TTS TTFT: Time to first audio from Text-to-Speech (target: less than 450ms)
- Ping Time: WebSocket round-trip latency (target: less than 100ms)
Throughput Metrics
- Active Sessions: Concurrent user connections
- Words per Second: STT/TTS/LLM processing rates
- Realtime Factor: TTS generation speed vs playback speed (target: less than 1.0)
LLM Optimization
Model Selection
Default model:meta-llama/Llama-3.2-1B-Instruct (6.1GB VRAM)
Recommended alternatives:
- Smaller models: Lower latency, less memory, reduced quality
- Larger models: Better responses, higher latency, more memory
Memory Configuration
--gpu-memory-utilization: Increase from 0.4 to 0.7-0.9 if GPU is dedicated to LLM--max-model-len: Reduce from 1536 to 1024 for shorter conversations, lower memory--dtype: Usebfloat16for best performance (requires Ampere+ GPUs)
Temperature Settings
Unmute uses different temperatures for varied responses (fromunmute_handler.py):
TTS/STT Optimization
Service Configuration
TTS and STT services are configured via TOML files inservices/moshi-server/configs/:
Volume Caching
Critical for fast startup and inference:/tmp/models/ to avoid startup delays.
Voice Cloning
Voice selection affects TTS latency. Voices are defined invoices.yaml:
Network Optimization
Audio Frame Size
Fromunmute_handler.py:
OUTPUT_FRAME_SIZE may reduce overhead but increases latency. The default 480 samples is optimized to prevent choppy audio.
WebSocket Configuration
Unmute uses Opus encoding for efficient audio streaming:Scaling Configuration
Docker Compose (Single Machine)
For development/small deployments:Docker Swarm (Production)
For high-load production deployments:ws://tasks.tts:8080 to discover and load-balance across TTS replicas.
Load Testing
Use the built-in load testing tool to measure performance:- STT, VAD, LLM, and TTS latencies
- Realtime factors (generation speed vs playback)
- Success/failure rates
- Percentile distributions (p90, p95)
Example Output
- STT latency p95: less than 100ms
- TTS start latency p95: less than 500ms
- TTS realtime factor p95: less than 1.0 (faster than realtime)
Interrupt Handling
Fromunmute_handler.py:
- Reduce
UNINTERRUPTIBLE_BY_VAD_TIME_SECfor more responsive interrupts (may cause echo issues) - Adjust
USER_SILENCE_TIMEOUTbased on expected conversation pacing
Monitoring Performance
Unmute exposes Prometheus metrics on the backend service:worker_active_sessions: Current loadworker_stt_ttft: STT time-to-first-token distributionworker_tts_ttft: TTS time-to-first-token distributionworker_vllm_ttft: LLM time-to-first-token distributionworker_tts_interrupt: Interrupt frequency
Production Optimizations
Based on unmute.sh deployment:1. Multi-GPU Setup
- Separate GPUs for STT, TTS, and LLM
- Result: 40% latency reduction (750ms → 450ms)
2. Horizontal Scaling
- 16 backend replicas for WebSocket handling
- 8 TTS replicas for audio generation
- 8 LLM replicas for text generation
3. Caching Strategy
- Persistent volumes for model caches
- Redis for session state (optional)
- Pre-warmed model instances
4. Resource Limits
Common Bottlenecks
High TTS Latency
Causes:- Shared GPU with LLM
- Large model contexts
- Network latency
- Dedicate GPU to TTS service
- Reduce LLM
--max-model-len - Use multi-GPU configuration
LLM Timeout
Causes:- Large context window
- Complex prompts
- GPU memory pressure
- Reduce
--max-model-len - Increase
--gpu-memory-utilization - Use smaller/faster model
Poor Throughput
Causes:- Single backend instance
- Insufficient GPU replicas
- CPU bottlenecks
- Scale backend replicas
- Add more GPU nodes
- Use Docker Swarm for horizontal scaling
Next Steps
- Multi-GPU Setup - Configure multiple GPUs
- Monitoring - Track performance metrics
- Debugging - Troubleshoot performance issues