Skip to main content

Prerequisites

Before you begin, ensure you have:
  • GPU: CUDA-capable GPU with 16GB+ VRAM
  • Architecture: x86_64 only (no aarch64 support)
  • Single GPU is sufficient for basic setup
  • Multi-GPU setup recommended for production (see below)

Step 1: Verify NVIDIA Container Toolkit

Confirm your GPU is accessible to Docker:
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
You should see your GPU information displayed, including:
  • GPU name and driver version
  • Memory usage and total VRAM
  • CUDA version
If this fails, install the NVIDIA Container Toolkit before proceeding.

Step 2: Get Hugging Face Access

Unmute uses open-weight models from Hugging Face. You’ll need a token to download them.
1

Create Hugging Face Account

Sign up at huggingface.co if you don’t have an account
2

Accept Model License

By default, Unmute uses Llama 3.2 1B Instruct. Visit the model page and accept the license terms.
For better quality with more VRAM, use Mistral Small 3.2 24B or Gemma 3 12B
3

Generate Access Token

Create a token with these settings:
  • Type: Fine-grained
  • Permission: Read access to contents of all public gated repos you can access
Never use tokens with write access when deploying publicly. If compromised, attackers could modify your Hugging Face content.
4

Set Environment Variable

Add your token to your shell configuration:
echo 'export HUGGING_FACE_HUB_TOKEN=hf_your_token_here' >> ~/.bashrc
source ~/.bashrc
Verify it’s set:
echo $HUGGING_FACE_HUB_TOKEN

Step 3: Clone Repository

git clone https://github.com/kyutai-labs/unmute.git
cd unmute

Step 4: Configure Memory (Optional)

The default docker-compose.yml uses Llama 3.2 1B which requires 16GB VRAM. If you have memory issues, adjust these settings:
llm:
  image: vllm/vllm-openai:v0.11.0
  command:
    [
      "--model=meta-llama/Llama-3.2-1B-Instruct",
      "--max-model-len=1536",
      "--dtype=bfloat16",
      "--gpu-memory-utilization=0.4",
    ]

Step 5: Launch Unmute

Start all services with a single command:
docker compose up --build
1

First Run (10-15 minutes)

Docker will:
  • Build container images
  • Download models from Hugging Face (~8GB)
  • Initialize services
Models are cached in ./volumes/hf-cache/ so subsequent starts are much faster (30-60 seconds)
2

Wait for Services

Monitor the logs. Services are ready when you see:
unmute-backend-1  | INFO:     Application startup complete.
unmute-frontend-1 | ✓ Ready in 3.2s
unmute-llm-1      | INFO:     Uvicorn running on http://0.0.0.0:8000
unmute-stt-1      | Listening on 0.0.0.0:8080
unmute-tts-1      | Listening on 0.0.0.0:8080
3

Access Unmute

Open your browser to:http://localhost:80
If port 80 is in use, edit docker-compose.yml and change "80:80" to "3000:80" under the traefik service, then access via http://localhost:3000

Step 6: Start Talking

1

Grant Microphone Access

Your browser will request microphone permission. Click “Allow”.
2

Select a Character

Choose from the available voices and personalities:
  • Watercooler: Casual small talk
  • Quiz show: Interactive trivia
  • Gertrude: Life advice and sympathy
  • More voices available in voices.yaml
3

Click Connect

The system establishes WebSocket connections and initializes the conversation.
4

Speak Naturally

Start talking! The bot will:
  • Transcribe your speech in real-time
  • Generate contextual responses
  • Speak back to you with the selected voice
Keyboard Shortcuts:
  • Press S to toggle subtitles for both user and bot
  • Press D for debug mode (requires enabling ALLOW_DEV_MODE in useKeyboardShortcuts.ts)

Multi-GPU Configuration

Running STT, TTS, and LLM on separate GPUs reduces TTS latency from ~750ms to ~450ms.
If you have 3+ GPUs, edit docker-compose.yml to assign dedicated GPUs:
docker-compose.yml
stt:
  # ...existing config...
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu]

tts:
  # ...existing config...
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu]

llm:
  # ...existing config...
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu]
By default, all services share available GPUs. This configuration ensures each service gets its own GPU.

Remote Access via SSH

Forward port 80 from remote to local port 3333:
ssh -N -L 3333:localhost:80 unmute-box
Then access via http://localhost:3333
Browsers require localhost or HTTPS for microphone access. Direct HTTP access to http://unmute-box:80 won’t work.
Modern browsers block microphone access over HTTP unless:
  • The origin is localhost or 127.0.0.1
  • The connection uses HTTPS
Port forwarding makes the remote server appear as localhost to your browser.

Common Issues

Reduce memory usage in docker-compose.yml:
llm:
  command:
    [
      "--model=meta-llama/Llama-3.2-1B-Instruct",
      "--max-model-len=1024",           # Lower from 1536
      "--gpu-memory-utilization=0.3",   # Lower from 0.4
    ]
Or increase batch sizes for TTS/STT to share memory better (higher latency).
Check your Hugging Face token:
echo $HUGGING_FACE_HUB_TOKEN
Verify you accepted the model license on Hugging Face.
Change the port in docker-compose.yml:
traefik:
  ports:
    - "8080:80"  # Use port 8080 instead
Ensure all services are running:
docker compose ps
Check backend logs:
docker compose logs backend

Next Steps

Customize Voices

Add custom voices and modify character personalities

Use External LLMs

Connect to OpenAI, Ollama, or other LLM providers

Production Deployment

Scale Unmute with Docker Swarm for production workloads

Development Guide

Contribute to Unmute or build custom frontends

Stop Unmute

To stop all services:
docker compose down
To also remove downloaded models and caches:
docker compose down -v
rm -rf volumes/
Need help? Open an issue on GitHub - the Kyutai team actively supports Docker Compose deployments.

Build docs developers (and LLMs) love