Requirements
Hardware
- GPU: CUDA-compatible GPU with at least 16 GB VRAM
- Architecture: x86_64 (aarch64 not supported)
- OS: Linux, or Windows with WSL (installation instructions)
Software
Setup Instructions
Verify NVIDIA Container Toolkit
Test that Docker can access your GPU:This should display your GPU information. If it fails, install the NVIDIA Container Toolkit first.
Get Hugging Face Access Token
Unmute uses gated models from Hugging Face that require authentication.Add this line to your
- Create a Hugging Face account
- Accept the conditions on the Mistral Small 3.2 24B model page
- Create an access token with “Read access to contents of all public gated repos you can access”
- Add the token to your environment:
~/.bashrc or equivalent to persist it across sessions.Verify Environment Variable
Confirm the token is set:This should print your token (starting with
hf_).Configuration
Adjusting GPU Memory
The default configuration uses Llama-3.2-1B-Instruct, which requires about 16GB of GPU memory. If you’re running into memory issues, check theNOTE: comments in docker-compose.yml for adjustable parameters:
docker-compose.yml
Using Multiple GPUs
On Unmute.sh, services run on separate GPUs, improving TTS latency from ~750ms (single L40S GPU) to ~450ms (multi-GPU setup). If you have at least three GPUs available, add this configuration to thestt, tts, and llm services:
docker-compose.yml
Changing the LLM
To use a different model, modify the--model parameter:
docker-compose.yml
- Gemma 3 12B - Better quality, requires more memory
- Mistral Small 3.2 24B - Production quality
Service Architecture
The Docker Compose setup includes these services:- traefik: Reverse proxy routing traffic between frontend and backend
- frontend: Next.js web interface (port 3000)
- backend: FastAPI server handling WebSocket connections (port 80)
- stt: Speech-to-text service (WebSocket on port 8080)
- tts: Text-to-speech service (WebSocket on port 8080)
- llm: VLLM server providing the language model (HTTP on port 8000)
Using External LLM Servers
You can configure Unmute to use external LLM providers instead of the local VLLM server.OpenAI
Modify the backend environment variables:docker-compose.yml
llm service section entirely.
Ollama
For a local Ollama instance:docker-compose.yml
llm service section.
Stopping Unmute
To stop all services:Volumes are used to cache models and build artifacts. The first run after removing volumes will be slow as models are re-downloaded.
Troubleshooting
GPU Not Detected
If services can’t access the GPU:- Verify NVIDIA drivers are installed:
nvidia-smi - Verify NVIDIA Container Toolkit is installed
- Check Docker daemon configuration in
/etc/docker/daemon.json
Out of Memory Errors
Adjust these parameters indocker-compose.yml:
- Reduce
--max-model-lenfor the LLM - Lower
--gpu-memory-utilization - Use a smaller model
Port Already in Use
If port 80 is already in use, modify the traefik service:docker-compose.yml
http://localhost:3333.
Next Steps
- Learn about remote access to connect from another machine
- Set up HTTPS for production deployments
- Explore Docker Swarm for multi-node scaling