GPU Setup

Unmute requires GPU acceleration to run the speech-to-text, text-to-speech, and LLM models with acceptable latency. This guide covers GPU requirements and configuration.

Hardware Requirements

Minimum Requirements

GPU: NVIDIA GPU with CUDA support
VRAM: At least 16 GB
Architecture: x86_64 (aarch64 is not supported)

Memory Usage by Service

When running all services, approximate VRAM usage:

Service	VRAM Required
Speech-to-text (STT)	2.5 GB
Text-to-speech (TTS)	5.3 GB
LLM (Llama 3.2 1B)	6.1 GB
Total	~14 GB

The default docker-compose.yml uses Llama 3.2 1B which fits in 16GB VRAM. If using larger models like Mistral Small 3.2 24B, you’ll need more VRAM.

Operating System Support

Windows: Native Windows is not supported (#84). Use WSL (Windows Subsystem for Linux) instead.macOS: Not supported (#74). macOS does not have NVIDIA GPU support.

Supported platforms:

Linux (native)
Windows with WSL 2

Docker Setup

Install NVIDIA Container Toolkit

The NVIDIA Container Toolkit allows Docker containers to access your GPU.

Install the Container Toolkit

Follow the official NVIDIA Container Toolkit installation guide.For Ubuntu/Debian:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Verify the Installation

Test that Docker can access your GPU:

sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

You should see output showing your GPU(s), similar to:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA L40S         Off  | 00000000:00:05.0 Off |                    0 |
| N/A   28C    P0    32W / 350W |      0MiB / 46068MiB |      0%      Default |

Configure GPU Access in Docker Compose

The docker-compose.yml file configures GPU access for the AI services:

services:
  tts:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all  # Use all available GPUs
              capabilities: [gpu]

Multi-GPU Configuration

Running services on separate GPUs significantly improves latency. On unmute.sh, TTS latency decreases from ~750ms (single L40S GPU) to ~450ms (multi-GPU setup).

Single GPU Setup (Default)

By default, all services share GPU(s) using count: all:

tts:
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: all
            capabilities: [gpu]

Dedicated GPU per Service

If you have 3+ GPUs, assign one GPU to each service for optimal performance:

Check Available GPUs

List your GPUs:

nvidia-smi -L

Example output:

GPU 0: NVIDIA L40S
GPU 1: NVIDIA L40S
GPU 2: NVIDIA L40S

Update docker-compose.yml

Modify the stt, tts, and llm services to use dedicated GPUs:

stt:
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1  # Changed from 'all' to '1'
            capabilities: [gpu]

tts:
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu]

llm:
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu]

Restart Services

Apply the changes:

docker compose down
docker compose up --build

Docker will automatically distribute services across available GPUs when using count: 1. You don’t need to manually specify device IDs.

Memory Optimization

If you’re running out of GPU memory, adjust these settings in docker-compose.yml:

LLM Memory Settings

llm:
  command:
    - "--model=meta-llama/Llama-3.2-1B-Instruct"
    # Reduce max context length to save memory
    - "--max-model-len=1536"  # Lower this value
    - "--dtype=bfloat16"
    # Reduce GPU memory usage percentage
    - "--gpu-memory-utilization=0.4"  # Lower this value (e.g., 0.3)

--max-model-len

integer

default:"1536"

Maximum context length for the LLM. Lower values use less memory but support shorter conversations.

--gpu-memory-utilization

float

default:"0.4"

Percentage of GPU memory to allocate (0.0-1.0). Lower values leave more memory for other services.

Switch to a Smaller Model

Use a smaller LLM model:

llm:
  command:
    - "--model=meta-llama/Llama-3.2-1B-Instruct"  # Smaller model

Available models (by size):

meta-llama/Llama-3.2-1B-Instruct - ~6 GB VRAM
google/gemma-3-1b-it - ~6 GB VRAM (note: slower on vLLM)
google/gemma-3-12b-it - ~12 GB VRAM
mistralai/Mistral-Small-3.2-24B-Instruct-2506 - ~24 GB VRAM

Dockerless Setup

For dockerless deployment, ensure CUDA 12.1+ is installed:

Install CUDA

Install CUDA 12.1 or later:

Via conda: conda install cuda -c nvidia/label/cuda-12.1.0
Or download from NVIDIA’s website

Verify Installation

nvcc --version
nvidia-smi

Run Services

The dockerless scripts automatically detect and use available GPUs:

./dockerless/start_stt.sh   # Uses 2.5GB VRAM
./dockerless/start_tts.sh   # Uses 5.3GB VRAM
./dockerless/start_llm.sh   # Uses 6.1GB VRAM

Troubleshooting

Docker can't access GPU

If nvidia-smi works but Docker can’t access the GPU:

Verify NVIDIA Container Toolkit is installed
Restart Docker: sudo systemctl restart docker
Check Docker runtime: docker info | grep -i runtime

Try the verification command again:

sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Out of memory errors

If services crash with OOM errors:

Check VRAM usage: nvidia-smi
Reduce --gpu-memory-utilization for the LLM
Lower --max-model-len for shorter conversations
Use a smaller LLM model
Stop other GPU-intensive applications

WSL GPU issues

For Windows WSL users:

Ensure you’re using WSL 2 (not WSL 1)
Update to the latest NVIDIA driver for Windows
Install NVIDIA CUDA on WSL following Microsoft’s guide
Don’t install NVIDIA drivers inside WSL - use the Windows driver

Get Started

Deployment

Configuration

Hardware Requirements

Minimum Requirements

Memory Usage by Service

Operating System Support

Docker Setup

Install NVIDIA Container Toolkit

Configure GPU Access in Docker Compose

Multi-GPU Configuration

Single GPU Setup (Default)

Dedicated GPU per Service

Memory Optimization

LLM Memory Settings

Switch to a Smaller Model

Dockerless Setup

Troubleshooting

Build docs developers (and LLMs) love

Get Started

Deployment

Configuration

​Hardware Requirements

​Minimum Requirements

​Memory Usage by Service

​Operating System Support

​Docker Setup

​Install NVIDIA Container Toolkit

​Configure GPU Access in Docker Compose

​Multi-GPU Configuration

​Single GPU Setup (Default)

​Dedicated GPU per Service

​Memory Optimization

​LLM Memory Settings

​Switch to a Smaller Model

​Dockerless Setup

​Troubleshooting

Build docs developers (and LLMs) love

Hardware Requirements

Minimum Requirements

Memory Usage by Service

Operating System Support

Docker Setup

Install NVIDIA Container Toolkit

Configure GPU Access in Docker Compose

Multi-GPU Configuration

Single GPU Setup (Default)

Dedicated GPU per Service

Memory Optimization

LLM Memory Settings

Switch to a Smaller Model

Dockerless Setup

Troubleshooting