How It Works
The Unmute backend communicates with the LLM server using the OpenAI-compatible API format. The LLM generates text responses based on the conversation history, which are then passed to the TTS system for speech synthesis.Configuration
LLM configuration is set through environment variables in thebackend service:
Environment Variables
The base URL of the OpenAI-compatible API endpoint
The model identifier to use for text generation
API key for authentication (if required by the provider)
Provider Examples
OpenAI
To use OpenAI’s GPT models:Get an API Key
Create an API key at platform.openai.com/api-keys
Remove VLLM Service (Optional)
Since you’re using an external LLM, you can remove the
llm service from docker-compose.yml to save resources:Ollama (Local)
To use a locally running Ollama instance:Install and Start Ollama
Install Ollama from ollama.ai and pull a model:
Update docker-compose.yml
Configure the backend to connect to your host machine:The
extra_hosts section allows Docker containers to access services running on the host machine.Remove VLLM Service (Optional)
Remove the
llm service definition from docker-compose.yml to free up GPU memory.Mistral AI
To use Mistral’s API:Custom VLLM Server
If you’re running your own VLLM server elsewhere:Dockerless Configuration
For dockerless deployments, set the environment variables before starting the backend:dockerless/start_backend.sh script to include these variables.
Model Selection
The LLM is displayed in conversation system prompts. The backend automatically formats the model name for readability (e.g.,meta-llama/Llama-3.2-1B-Instruct becomes “meta llama Llama 3.2 1B Instruct”).
Recommended Models
For local VLLM deployment:meta-llama/Llama-3.2-1B-Instruct- Lightweight, 16GB VRAM (default)mistralai/Mistral-Small-3.2-24B-Instruct-2506- Higher quality, requires more VRAMgoogle/gemma-3-12b-it- Good balance of quality and performance
- OpenAI:
gpt-4o,gpt-4.1, orgpt-3.5-turbo - Mistral:
mistral-small-latestormistral-large-latest - Ollama:
gemma3,llama3,mistral, etc.
LLM Requirements for Unmute
Unmute works best with models that:Support streaming text generation for low-latency responses
Are optimized for conversational tasks
Follow instruction formats well
Generate concise responses suitable for speech
System Prompt Integration
The LLM receives a detailed system prompt that includes:- Base instructions for voice conversation format
- Character-specific personality traits (from
voices.yaml) - Language instructions (English/French support)
- Context about the Unmute system itself
- Current date, time, and timezone (for relevant character types)
unmute/llm/system_prompt.py and automatically includes the LLM model name.
Performance Considerations
Latency Optimization Tips
- Use streaming: Unmute starts TTS generation before the LLM finishes the full response
- Choose faster models: Smaller models (1-12B parameters) respond faster than large models
- Local deployment: Running VLLM locally eliminates network latency
- Adjust max tokens: Configure
--max-model-lenin VLLM to balance context vs. speed
VLLM Configuration
If you’re running your own VLLM instance, optimize these parameters:Testing Your Configuration
After configuring your external LLM:Check Backend Logs
Verify the backend connects successfully:Look for successful LLM initialization messages.
Test a Conversation
Open Unmute in your browser and start a conversation. The LLM model name will be mentioned in the character’s system prompt.
Troubleshooting
Connection refused errors
Connection refused errors
- For localhost Ollama: Ensure
extra_hostsis configured correctly - For external APIs: Check your network can reach the API URL
- Verify the API endpoint is correct (usually ends with
/v1)
Authentication errors
Authentication errors
- Verify your API key is correct and has not expired
- Check that the API key has proper permissions
- For Ollama, any string works as the API key (e.g., “ollama”)
Model not found errors
Model not found errors
- Verify the model name is correct for your provider
- For Ollama: Ensure you’ve pulled the model with
ollama pull <model> - For OpenAI: Check available models at platform.openai.com/docs/models
Slow response times
Slow response times
- Consider using a smaller/faster model
- For cloud APIs: Check your network latency
- For local VLLM: Increase
--gpu-memory-utilizationif you have spare GPU memory - Reduce
--max-model-lento decrease memory usage