Skip to main content
Docker Compose is the recommended way to deploy Unmute. It allows you to start or stop all services using a single command, and since the services are Docker containers, you get a reproducible environment without worrying about dependencies.

Requirements

Hardware

  • GPU: CUDA-compatible GPU with at least 16 GB VRAM
  • Architecture: x86_64 (aarch64 not supported)
  • OS: Linux, or Windows with WSL (installation instructions)
Windows native deployment is not supported. Running on Mac is also not supported.

Software

Setup Instructions

1

Verify NVIDIA Container Toolkit

Test that Docker can access your GPU:
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
This should display your GPU information. If it fails, install the NVIDIA Container Toolkit first.
2

Get Hugging Face Access Token

Unmute uses gated models from Hugging Face that require authentication.
  1. Create a Hugging Face account
  2. Accept the conditions on the Mistral Small 3.2 24B model page
  3. Create an access token with “Read access to contents of all public gated repos you can access”
Do not use tokens with write access when deploying publicly. In case of server compromise, an attacker would get write access to your Hugging Face resources.
  1. Add the token to your environment:
export HUGGING_FACE_HUB_TOKEN=hf_...your token here...
Add this line to your ~/.bashrc or equivalent to persist it across sessions.
3

Verify Environment Variable

Confirm the token is set:
echo $HUGGING_FACE_HUB_TOKEN
This should print your token (starting with hf_).
4

Start Unmute

From the repository root, run:
docker compose up --build
The first run will take several minutes to:
  • Build Docker images
  • Download models from Hugging Face
  • Initialize all services
Once complete, access Unmute at http://localhost (port 80).

Configuration

Adjusting GPU Memory

The default configuration uses Llama-3.2-1B-Instruct, which requires about 16GB of GPU memory. If you’re running into memory issues, check the NOTE: comments in docker-compose.yml for adjustable parameters:
docker-compose.yml
llm:
  image: vllm/vllm-openai:v0.11.0
  command:
    [
      "--model=meta-llama/Llama-3.2-1B-Instruct",
      # Adapt this based on your GPU memory
      "--max-model-len=1536",
      "--dtype=bfloat16",
      # Lower this if you're running out of memory
      "--gpu-memory-utilization=0.4",
    ]

Using Multiple GPUs

On Unmute.sh, services run on separate GPUs, improving TTS latency from ~750ms (single L40S GPU) to ~450ms (multi-GPU setup). If you have at least three GPUs available, add this configuration to the stt, tts, and llm services:
docker-compose.yml
stt:  # Similarly for `tts` and `llm`
  # ...other configuration
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu]
This ensures each service runs on a separate GPU.

Changing the LLM

To use a different model, modify the --model parameter:
docker-compose.yml
llm:
  command:
    [
      # Change this to your preferred model
      "--model=google/gemma-3-1b-it",
      # ...
    ]
Alternatives to consider:

Service Architecture

The Docker Compose setup includes these services:
  • traefik: Reverse proxy routing traffic between frontend and backend
  • frontend: Next.js web interface (port 3000)
  • backend: FastAPI server handling WebSocket connections (port 80)
  • stt: Speech-to-text service (WebSocket on port 8080)
  • tts: Text-to-speech service (WebSocket on port 8080)
  • llm: VLLM server providing the language model (HTTP on port 8000)

Using External LLM Servers

You can configure Unmute to use external LLM providers instead of the local VLLM server.

OpenAI

Modify the backend environment variables:
docker-compose.yml
backend:
  environment:
    - KYUTAI_LLM_URL=https://api.openai.com/v1
    - KYUTAI_LLM_MODEL=gpt-4o
    - KYUTAI_LLM_API_KEY=sk-...
Then remove the llm service section entirely.

Ollama

For a local Ollama instance:
docker-compose.yml
backend:
  environment:
    - KYUTAI_LLM_URL=http://host.docker.internal:11434
    - KYUTAI_LLM_MODEL=gemma3
    - KYUTAI_LLM_API_KEY=ollama
  extra_hosts:
    - "host.docker.internal:host-gateway"
Then remove the llm service section.

Stopping Unmute

To stop all services:
docker compose down
To stop and remove all data (including downloaded models):
docker compose down -v
Volumes are used to cache models and build artifacts. The first run after removing volumes will be slow as models are re-downloaded.

Troubleshooting

GPU Not Detected

If services can’t access the GPU:
  1. Verify NVIDIA drivers are installed: nvidia-smi
  2. Verify NVIDIA Container Toolkit is installed
  3. Check Docker daemon configuration in /etc/docker/daemon.json

Out of Memory Errors

Adjust these parameters in docker-compose.yml:
  • Reduce --max-model-len for the LLM
  • Lower --gpu-memory-utilization
  • Use a smaller model

Port Already in Use

If port 80 is already in use, modify the traefik service:
docker-compose.yml
traefik:
  ports:
    - "3333:80"  # Use port 3333 instead
Then access Unmute at http://localhost:3333.

Next Steps

Build docs developers (and LLMs) love