Skip to main content

Overview

Unmute can leverage multiple GPUs to distribute the workload across Speech-to-Text (STT), Text-to-Speech (TTS), and Language Model (LLM) services. Running these services on separate GPUs significantly improves latency compared to a single-GPU setup. Performance improvement: On production deployments like unmute.sh, running on separate GPUs reduces TTS latency from ~750ms (single L40S GPU) to ~450ms (multi-GPU setup).

GPU Memory Requirements

Each service requires specific GPU memory:
  • STT (Speech-to-Text): 2.5GB VRAM
  • TTS (Text-to-Speech): 5.3GB VRAM
  • LLM (Language Model): 6.1GB VRAM (for Llama-3.2-1B)
Total recommended: At least 16GB VRAM (can run on single GPU) or 3+ GPUs for optimal performance.

Docker Compose Configuration

Default Configuration (All GPUs)

By default, docker-compose.yml allocates all available GPUs to each service:
deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          count: all
          capabilities: [gpu]
This works well for single-GPU setups but doesn’t optimize multi-GPU systems. For systems with 3+ GPUs, configure each service to use a dedicated GPU. Modify the stt, tts, and llm service definitions in docker-compose.yml:
stt:
  image: moshi-server:latest
  command: ["worker", "--config", "configs/stt.toml"]
  # ... other configuration ...
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu]

tts:
  image: moshi-server:latest
  command: ["worker", "--config", "configs/tts.toml"]
  # ... other configuration ...
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu]

llm:
  image: vllm/vllm-openai:v0.11.0
  command:
    [
      "--model=meta-llama/Llama-3.2-1B-Instruct",
      "--max-model-len=1536",
      "--dtype=bfloat16",
      "--gpu-memory-utilization=0.4",
    ]
  # ... other configuration ...
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu]
With count: 1, Docker Swarm automatically distributes services across available GPUs.

GPU Memory Optimization

If running into memory issues on a single GPU, adjust these parameters in docker-compose.yml:

LLM Memory Settings

llm:
  command:
    [
      "--model=meta-llama/Llama-3.2-1B-Instruct",
      # Reduce max conversation length
      "--max-model-len=1536",  # Lower this value (default: 1536)
      "--dtype=bfloat16",
      # Reduce GPU memory allocation
      "--gpu-memory-utilization=0.4",  # Lower this value (range: 0.0-1.0)
    ]
Parameters:
  • --max-model-len: Maximum conversation length in tokens. Lower values = less memory but shorter conversations
  • --gpu-memory-utilization: Percentage of GPU memory to use (0.4 = 40%). Lower values leave room for other services

Verifying GPU Usage

Check GPU allocation with nvidia-smi:
# On the host machine
nvidia-smi

# Inside a container
docker exec -it <container_name> nvidia-smi
You should see separate processes for STT, TTS, and LLM services on different GPUs (if configured for multi-GPU).

Docker Swarm Multi-Node Setup

For production deployments across multiple machines, Docker Swarm provides advanced GPU scheduling. See the swarm-deploy.yml configuration:
tts:
  deploy:
    replicas: 8  # Multiple replicas for load balancing
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu]
    placement:
      constraints:
        - node.labels.gpu==true  # Only nodes with GPU label
      max_replicas_per_node: 1   # One replica per node
Benefits:
  • Horizontal scaling across multiple GPU nodes
  • Automatic failover and load balancing
  • Independent scaling of each service
For full swarm deployment instructions, see SWARM.md.

Troubleshooting

Service fails to start

Error: could not select device driver "nvidia" with capabilities: [[gpu]] Solution: Install NVIDIA Container Toolkit:
# Verify installation
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Out of memory errors

Symptoms: Container crashes or CUDA out of memory errors Solutions:
  1. Use a smaller LLM model (e.g., Llama-3.2-1B instead of larger models)
  2. Reduce --gpu-memory-utilization for the LLM service
  3. Lower --max-model-len to reduce context window
  4. Add more GPUs and use the multi-GPU configuration

Poor performance despite multiple GPUs

Check:
  1. Verify each service is on a different GPU with nvidia-smi
  2. Ensure count: 1 is set in deploy configuration (not count: all)
  3. Monitor GPU utilization - underutilized GPUs may indicate bottlenecks elsewhere

Next Steps

Build docs developers (and LLMs) love