Docker Compose Deployment

Docker Compose is the recommended way to deploy Unmute. It allows you to start or stop all services using a single command, and since the services are Docker containers, you get a reproducible environment without worrying about dependencies.

Requirements

Hardware

GPU: CUDA-compatible GPU with at least 16 GB VRAM
Architecture: x86_64 (aarch64 not supported)
OS: Linux, or Windows with WSL (installation instructions)

Windows native deployment is not supported. Running on Mac is also not supported.

Software

Setup Instructions

Verify NVIDIA Container Toolkit

Test that Docker can access your GPU:

sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

This should display your GPU information. If it fails, install the NVIDIA Container Toolkit first.

Get Hugging Face Access Token

Unmute uses gated models from Hugging Face that require authentication.

Create a Hugging Face account
Accept the conditions on the Mistral Small 3.2 24B model page
Create an access token with “Read access to contents of all public gated repos you can access”

Do not use tokens with write access when deploying publicly. In case of server compromise, an attacker would get write access to your Hugging Face resources.

Add the token to your environment:

export HUGGING_FACE_HUB_TOKEN=hf_...your token here...

Add this line to your ~/.bashrc or equivalent to persist it across sessions.

Verify Environment Variable

Confirm the token is set:

echo $HUGGING_FACE_HUB_TOKEN

This should print your token (starting with hf_).

Start Unmute

From the repository root, run:

docker compose up --build

The first run will take several minutes to:

Build Docker images
Download models from Hugging Face
Initialize all services

Once complete, access Unmute at http://localhost (port 80).

Configuration

Adjusting GPU Memory

The default configuration uses Llama-3.2-1B-Instruct, which requires about 16GB of GPU memory. If you’re running into memory issues, check the NOTE: comments in docker-compose.yml for adjustable parameters:

docker-compose.yml

llm:
  image: vllm/vllm-openai:v0.11.0
  command:
    [
      "--model=meta-llama/Llama-3.2-1B-Instruct",
      # Adapt this based on your GPU memory
      "--max-model-len=1536",
      "--dtype=bfloat16",
      # Lower this if you're running out of memory
      "--gpu-memory-utilization=0.4",
    ]

Using Multiple GPUs

On Unmute.sh, services run on separate GPUs, improving TTS latency from ~750ms (single L40S GPU) to ~450ms (multi-GPU setup). If you have at least three GPUs available, add this configuration to the stt, tts, and llm services:

docker-compose.yml

stt:  # Similarly for `tts` and `llm`
  # ...other configuration
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu]

This ensures each service runs on a separate GPU.

Changing the LLM

To use a different model, modify the --model parameter:

docker-compose.yml

llm:
  command:
    [
      # Change this to your preferred model
      "--model=google/gemma-3-1b-it",
      # ...
    ]

Alternatives to consider:

Gemma 3 12B - Better quality, requires more memory
Mistral Small 3.2 24B - Production quality

Service Architecture

The Docker Compose setup includes these services:

traefik: Reverse proxy routing traffic between frontend and backend
frontend: Next.js web interface (port 3000)
backend: FastAPI server handling WebSocket connections (port 80)
stt: Speech-to-text service (WebSocket on port 8080)
tts: Text-to-speech service (WebSocket on port 8080)
llm: VLLM server providing the language model (HTTP on port 8000)

Using External LLM Servers

You can configure Unmute to use external LLM providers instead of the local VLLM server.

OpenAI

Modify the backend environment variables:

docker-compose.yml

backend:
  environment:
    - KYUTAI_LLM_URL=https://api.openai.com/v1
    - KYUTAI_LLM_MODEL=gpt-4o
    - KYUTAI_LLM_API_KEY=sk-...

Then remove the llm service section entirely.

Ollama

For a local Ollama instance:

docker-compose.yml

backend:
  environment:
    - KYUTAI_LLM_URL=http://host.docker.internal:11434
    - KYUTAI_LLM_MODEL=gemma3
    - KYUTAI_LLM_API_KEY=ollama
  extra_hosts:
    - "host.docker.internal:host-gateway"

Then remove the llm service section.

Stopping Unmute

To stop all services:

docker compose down

To stop and remove all data (including downloaded models):

docker compose down -v

Volumes are used to cache models and build artifacts. The first run after removing volumes will be slow as models are re-downloaded.

Troubleshooting

GPU Not Detected

If services can’t access the GPU:

Verify NVIDIA drivers are installed: nvidia-smi
Verify NVIDIA Container Toolkit is installed
Check Docker daemon configuration in /etc/docker/daemon.json

Out of Memory Errors

Adjust these parameters in docker-compose.yml:

Reduce --max-model-len for the LLM
Lower --gpu-memory-utilization
Use a smaller model

Port Already in Use

If port 80 is already in use, modify the traefik service:

docker-compose.yml

traefik:
  ports:
    - "3333:80"  # Use port 3333 instead

Then access Unmute at http://localhost:3333.

Next Steps

Learn about remote access to connect from another machine
Set up HTTPS for production deployments
Explore Docker Swarm for multi-node scaling

Get Started

Deployment

Configuration

Docker Compose Deployment

Requirements

Hardware

Software

Setup Instructions

Configuration

Adjusting GPU Memory

Using Multiple GPUs

Changing the LLM

Service Architecture

Using External LLM Servers

OpenAI

Ollama

Stopping Unmute

Troubleshooting

GPU Not Detected

Out of Memory Errors

Port Already in Use

Next Steps

Build docs developers (and LLMs) love

Get Started

Deployment

Configuration

​Requirements

​Hardware

​Software

​Setup Instructions

​Configuration

​Adjusting GPU Memory

​Using Multiple GPUs

​Changing the LLM

​Service Architecture

​Using External LLM Servers

​OpenAI

​Ollama

​Stopping Unmute

​Troubleshooting

​GPU Not Detected

​Out of Memory Errors

​Port Already in Use

​Next Steps

Build docs developers (and LLMs) love

Requirements

Hardware

Software

Setup Instructions

Configuration

Adjusting GPU Memory

Using Multiple GPUs

Changing the LLM

Service Architecture

Using External LLM Servers

OpenAI

Ollama

Stopping Unmute

Troubleshooting

GPU Not Detected

Out of Memory Errors

Port Already in Use

Next Steps