Multi-GPU Setup

Overview

Unmute can leverage multiple GPUs to distribute the workload across Speech-to-Text (STT), Text-to-Speech (TTS), and Language Model (LLM) services. Running these services on separate GPUs significantly improves latency compared to a single-GPU setup. Performance improvement: On production deployments like unmute.sh, running on separate GPUs reduces TTS latency from ~750ms (single L40S GPU) to ~450ms (multi-GPU setup).

GPU Memory Requirements

Each service requires specific GPU memory:

STT (Speech-to-Text): 2.5GB VRAM
TTS (Text-to-Speech): 5.3GB VRAM
LLM (Language Model): 6.1GB VRAM (for Llama-3.2-1B)

Total recommended: At least 16GB VRAM (can run on single GPU) or 3+ GPUs for optimal performance.

Docker Compose Configuration

Default Configuration (All GPUs)

By default, docker-compose.yml allocates all available GPUs to each service:

deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          count: all
          capabilities: [gpu]

This works well for single-GPU setups but doesn’t optimize multi-GPU systems.

Multi-GPU Configuration (Recommended)

For systems with 3+ GPUs, configure each service to use a dedicated GPU. Modify the stt, tts, and llm service definitions in docker-compose.yml:

stt:
  image: moshi-server:latest
  command: ["worker", "--config", "configs/stt.toml"]
  # ... other configuration ...
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu]

tts:
  image: moshi-server:latest
  command: ["worker", "--config", "configs/tts.toml"]
  # ... other configuration ...
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu]

llm:
  image: vllm/vllm-openai:v0.11.0
  command:
    [
      "--model=meta-llama/Llama-3.2-1B-Instruct",
      "--max-model-len=1536",
      "--dtype=bfloat16",
      "--gpu-memory-utilization=0.4",
    ]
  # ... other configuration ...
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu]

With count: 1, Docker Swarm automatically distributes services across available GPUs.

GPU Memory Optimization

If running into memory issues on a single GPU, adjust these parameters in docker-compose.yml:

LLM Memory Settings

llm:
  command:
    [
      "--model=meta-llama/Llama-3.2-1B-Instruct",
      # Reduce max conversation length
      "--max-model-len=1536",  # Lower this value (default: 1536)
      "--dtype=bfloat16",
      # Reduce GPU memory allocation
      "--gpu-memory-utilization=0.4",  # Lower this value (range: 0.0-1.0)
    ]

Parameters:

--max-model-len: Maximum conversation length in tokens. Lower values = less memory but shorter conversations
--gpu-memory-utilization: Percentage of GPU memory to use (0.4 = 40%). Lower values leave room for other services

Verifying GPU Usage

Check GPU allocation with nvidia-smi:

# On the host machine
nvidia-smi

# Inside a container
docker exec -it <container_name> nvidia-smi

You should see separate processes for STT, TTS, and LLM services on different GPUs (if configured for multi-GPU).

Docker Swarm Multi-Node Setup

For production deployments across multiple machines, Docker Swarm provides advanced GPU scheduling. See the swarm-deploy.yml configuration:

tts:
  deploy:
    replicas: 8  # Multiple replicas for load balancing
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu]
    placement:
      constraints:
        - node.labels.gpu==true  # Only nodes with GPU label
      max_replicas_per_node: 1   # One replica per node

Benefits:

Horizontal scaling across multiple GPU nodes
Automatic failover and load balancing
Independent scaling of each service

For full swarm deployment instructions, see SWARM.md.

Troubleshooting

Service fails to start

Error: could not select device driver "nvidia" with capabilities: [[gpu]] Solution: Install NVIDIA Container Toolkit:

# Verify installation
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Out of memory errors

Symptoms: Container crashes or CUDA out of memory errors Solutions:

Use a smaller LLM model (e.g., Llama-3.2-1B instead of larger models)
Reduce --gpu-memory-utilization for the LLM service
Lower --max-model-len to reduce context window
Add more GPUs and use the multi-GPU configuration

Poor performance despite multiple GPUs

Check:

Verify each service is on a different GPU with nvidia-smi
Ensure count: 1 is set in deploy configuration (not count: all)
Monitor GPU utilization - underutilized GPUs may indicate bottlenecks elsewhere

Next Steps

Performance Tuning - Optimize latency and throughput
Monitoring - Track GPU metrics with Prometheus and Grafana
Debugging - Troubleshoot GPU-related issues

Customization

Advanced

Multi-GPU Setup

Overview

GPU Memory Requirements

Docker Compose Configuration

Default Configuration (All GPUs)

Multi-GPU Configuration (Recommended)

GPU Memory Optimization

LLM Memory Settings

Verifying GPU Usage

Docker Swarm Multi-Node Setup

Troubleshooting

Service fails to start

Out of memory errors

Poor performance despite multiple GPUs

Next Steps

Build docs developers (and LLMs) love

Customization

Advanced

​Overview

​GPU Memory Requirements

​Docker Compose Configuration

​Default Configuration (All GPUs)

​Multi-GPU Configuration (Recommended)

​GPU Memory Optimization

​LLM Memory Settings

​Verifying GPU Usage

​Docker Swarm Multi-Node Setup

​Troubleshooting

​Service fails to start

​Out of memory errors

​Poor performance despite multiple GPUs

​Next Steps

Build docs developers (and LLMs) love

Overview

GPU Memory Requirements

Docker Compose Configuration

Default Configuration (All GPUs)

Multi-GPU Configuration (Recommended)

GPU Memory Optimization

LLM Memory Settings

Verifying GPU Usage

Docker Swarm Multi-Node Setup

Troubleshooting

Service fails to start

Out of memory errors

Poor performance despite multiple GPUs

Next Steps