Overview
Unmute can leverage multiple GPUs to distribute the workload across Speech-to-Text (STT), Text-to-Speech (TTS), and Language Model (LLM) services. Running these services on separate GPUs significantly improves latency compared to a single-GPU setup. Performance improvement: On production deployments like unmute.sh, running on separate GPUs reduces TTS latency from ~750ms (single L40S GPU) to ~450ms (multi-GPU setup).GPU Memory Requirements
Each service requires specific GPU memory:- STT (Speech-to-Text): 2.5GB VRAM
- TTS (Text-to-Speech): 5.3GB VRAM
- LLM (Language Model): 6.1GB VRAM (for Llama-3.2-1B)
Docker Compose Configuration
Default Configuration (All GPUs)
By default,docker-compose.yml allocates all available GPUs to each service:
Multi-GPU Configuration (Recommended)
For systems with 3+ GPUs, configure each service to use a dedicated GPU. Modify thestt, tts, and llm service definitions in docker-compose.yml:
count: 1, Docker Swarm automatically distributes services across available GPUs.
GPU Memory Optimization
If running into memory issues on a single GPU, adjust these parameters indocker-compose.yml:
LLM Memory Settings
--max-model-len: Maximum conversation length in tokens. Lower values = less memory but shorter conversations--gpu-memory-utilization: Percentage of GPU memory to use (0.4 = 40%). Lower values leave room for other services
Verifying GPU Usage
Check GPU allocation withnvidia-smi:
Docker Swarm Multi-Node Setup
For production deployments across multiple machines, Docker Swarm provides advanced GPU scheduling. See the swarm-deploy.yml configuration:- Horizontal scaling across multiple GPU nodes
- Automatic failover and load balancing
- Independent scaling of each service
Troubleshooting
Service fails to start
Error:could not select device driver "nvidia" with capabilities: [[gpu]]
Solution: Install NVIDIA Container Toolkit:
Out of memory errors
Symptoms: Container crashes orCUDA out of memory errors
Solutions:
- Use a smaller LLM model (e.g., Llama-3.2-1B instead of larger models)
- Reduce
--gpu-memory-utilizationfor the LLM service - Lower
--max-model-lento reduce context window - Add more GPUs and use the multi-GPU configuration
Poor performance despite multiple GPUs
Check:- Verify each service is on a different GPU with
nvidia-smi - Ensure
count: 1is set in deploy configuration (notcount: all) - Monitor GPU utilization - underutilized GPUs may indicate bottlenecks elsewhere
Next Steps
- Performance Tuning - Optimize latency and throughput
- Monitoring - Track GPU metrics with Prometheus and Grafana
- Debugging - Troubleshoot GPU-related issues