Skip to main content
Unmute requires GPU acceleration to run the speech-to-text, text-to-speech, and LLM models with acceptable latency. This guide covers GPU requirements and configuration.

Hardware Requirements

Minimum Requirements

  • GPU: NVIDIA GPU with CUDA support
  • VRAM: At least 16 GB
  • Architecture: x86_64 (aarch64 is not supported)

Memory Usage by Service

When running all services, approximate VRAM usage:
ServiceVRAM Required
Speech-to-text (STT)2.5 GB
Text-to-speech (TTS)5.3 GB
LLM (Llama 3.2 1B)6.1 GB
Total~14 GB
The default docker-compose.yml uses Llama 3.2 1B which fits in 16GB VRAM. If using larger models like Mistral Small 3.2 24B, you’ll need more VRAM.

Operating System Support

Windows: Native Windows is not supported (#84). Use WSL (Windows Subsystem for Linux) instead.macOS: Not supported (#74). macOS does not have NVIDIA GPU support.
Supported platforms:
  • Linux (native)
  • Windows with WSL 2

Docker Setup

Install NVIDIA Container Toolkit

The NVIDIA Container Toolkit allows Docker containers to access your GPU.
1

Install the Container Toolkit

Follow the official NVIDIA Container Toolkit installation guide.For Ubuntu/Debian:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
2

Verify the Installation

Test that Docker can access your GPU:
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
You should see output showing your GPU(s), similar to:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA L40S         Off  | 00000000:00:05.0 Off |                    0 |
| N/A   28C    P0    32W / 350W |      0MiB / 46068MiB |      0%      Default |

Configure GPU Access in Docker Compose

The docker-compose.yml file configures GPU access for the AI services:
services:
  tts:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all  # Use all available GPUs
              capabilities: [gpu]

Multi-GPU Configuration

Running services on separate GPUs significantly improves latency. On unmute.sh, TTS latency decreases from ~750ms (single L40S GPU) to ~450ms (multi-GPU setup).

Single GPU Setup (Default)

By default, all services share GPU(s) using count: all:
tts:
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: all
            capabilities: [gpu]

Dedicated GPU per Service

If you have 3+ GPUs, assign one GPU to each service for optimal performance:
1

Check Available GPUs

List your GPUs:
nvidia-smi -L
Example output:
GPU 0: NVIDIA L40S
GPU 1: NVIDIA L40S
GPU 2: NVIDIA L40S
2

Update docker-compose.yml

Modify the stt, tts, and llm services to use dedicated GPUs:
stt:
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1  # Changed from 'all' to '1'
            capabilities: [gpu]

tts:
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu]

llm:
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu]
3

Restart Services

Apply the changes:
docker compose down
docker compose up --build
Docker will automatically distribute services across available GPUs when using count: 1. You don’t need to manually specify device IDs.

Memory Optimization

If you’re running out of GPU memory, adjust these settings in docker-compose.yml:

LLM Memory Settings

llm:
  command:
    - "--model=meta-llama/Llama-3.2-1B-Instruct"
    # Reduce max context length to save memory
    - "--max-model-len=1536"  # Lower this value
    - "--dtype=bfloat16"
    # Reduce GPU memory usage percentage
    - "--gpu-memory-utilization=0.4"  # Lower this value (e.g., 0.3)
--max-model-len
integer
default:"1536"
Maximum context length for the LLM. Lower values use less memory but support shorter conversations.
--gpu-memory-utilization
float
default:"0.4"
Percentage of GPU memory to allocate (0.0-1.0). Lower values leave more memory for other services.

Switch to a Smaller Model

Use a smaller LLM model:
llm:
  command:
    - "--model=meta-llama/Llama-3.2-1B-Instruct"  # Smaller model
Available models (by size):
  • meta-llama/Llama-3.2-1B-Instruct - ~6 GB VRAM
  • google/gemma-3-1b-it - ~6 GB VRAM (note: slower on vLLM)
  • google/gemma-3-12b-it - ~12 GB VRAM
  • mistralai/Mistral-Small-3.2-24B-Instruct-2506 - ~24 GB VRAM

Dockerless Setup

For dockerless deployment, ensure CUDA 12.1+ is installed:
1

Install CUDA

Install CUDA 12.1 or later:
  • Via conda: conda install cuda -c nvidia/label/cuda-12.1.0
  • Or download from NVIDIA’s website
2

Verify Installation

nvcc --version
nvidia-smi
3

Run Services

The dockerless scripts automatically detect and use available GPUs:
./dockerless/start_stt.sh   # Uses 2.5GB VRAM
./dockerless/start_tts.sh   # Uses 5.3GB VRAM
./dockerless/start_llm.sh   # Uses 6.1GB VRAM

Troubleshooting

If nvidia-smi works but Docker can’t access the GPU:
  1. Verify NVIDIA Container Toolkit is installed
  2. Restart Docker: sudo systemctl restart docker
  3. Check Docker runtime: docker info | grep -i runtime
  4. Try the verification command again:
    sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
    
If services crash with OOM errors:
  1. Check VRAM usage: nvidia-smi
  2. Reduce --gpu-memory-utilization for the LLM
  3. Lower --max-model-len for shorter conversations
  4. Use a smaller LLM model
  5. Stop other GPU-intensive applications
For Windows WSL users:
  1. Ensure you’re using WSL 2 (not WSL 1)
  2. Update to the latest NVIDIA driver for Windows
  3. Install NVIDIA CUDA on WSL following Microsoft’s guide
  4. Don’t install NVIDIA drivers inside WSL - use the Windows driver

Build docs developers (and LLMs) love