Skip to main content
Unmute is designed to work with any OpenAI-compatible LLM server. By default, it uses a local VLLM instance, but you can easily switch to external providers like OpenAI, Mistral, Ollama, or any other compatible API.

How It Works

The Unmute backend communicates with the LLM server using the OpenAI-compatible API format. The LLM generates text responses based on the conversation history, which are then passed to the TTS system for speech synthesis.

Configuration

LLM configuration is set through environment variables in the backend service:
backend:
  environment:
    - KYUTAI_LLM_URL=http://llm:8000
    - KYUTAI_LLM_MODEL=meta-llama/Llama-3.2-1B-Instruct
    - KYUTAI_LLM_API_KEY=your-api-key

Environment Variables

KYUTAI_LLM_URL
string
required
The base URL of the OpenAI-compatible API endpoint
KYUTAI_LLM_MODEL
string
required
The model identifier to use for text generation
KYUTAI_LLM_API_KEY
string
API key for authentication (if required by the provider)

Provider Examples

OpenAI

To use OpenAI’s GPT models:
1

Get an API Key

Create an API key at platform.openai.com/api-keys
2

Update docker-compose.yml

Modify the backend service configuration:
backend:
  image: unmute-backend:latest
  environment:
    - KYUTAI_STT_URL=ws://stt:8080
    - KYUTAI_TTS_URL=ws://tts:8080
    - KYUTAI_LLM_URL=https://api.openai.com/v1
    - KYUTAI_LLM_MODEL=gpt-4.1
    - KYUTAI_LLM_API_KEY=sk-...
    - NEWSAPI_API_KEY=$NEWSAPI_API_KEY
3

Remove VLLM Service (Optional)

Since you’re using an external LLM, you can remove the llm service from docker-compose.yml to save resources:
# Remove or comment out this entire section:
# llm:
#   image: vllm/vllm-openai:v0.11.0
#   ...
4

Restart Services

Apply the changes:
docker compose up --build

Ollama (Local)

To use a locally running Ollama instance:
1

Install and Start Ollama

Install Ollama from ollama.ai and pull a model:
ollama pull gemma3
2

Update docker-compose.yml

Configure the backend to connect to your host machine:
backend:
  image: unmute-backend:latest
  environment:
    - KYUTAI_STT_URL=ws://stt:8080
    - KYUTAI_TTS_URL=ws://tts:8080
    - KYUTAI_LLM_URL=http://host.docker.internal:11434
    - KYUTAI_LLM_MODEL=gemma3
    - KYUTAI_LLM_API_KEY=ollama
    - NEWSAPI_API_KEY=$NEWSAPI_API_KEY
  extra_hosts:
    - "host.docker.internal:host-gateway"
The extra_hosts section allows Docker containers to access services running on the host machine.
3

Remove VLLM Service (Optional)

Remove the llm service definition from docker-compose.yml to free up GPU memory.
4

Restart Services

docker compose up --build

Mistral AI

To use Mistral’s API:
backend:
  environment:
    - KYUTAI_LLM_URL=https://api.mistral.ai/v1
    - KYUTAI_LLM_MODEL=mistral-small-latest
    - KYUTAI_LLM_API_KEY=your-mistral-api-key

Custom VLLM Server

If you’re running your own VLLM server elsewhere:
backend:
  environment:
    - KYUTAI_LLM_URL=http://your-vllm-server:8000
    - KYUTAI_LLM_MODEL=meta-llama/Llama-3.2-1B-Instruct
    # API key may not be needed for private VLLM instances

Dockerless Configuration

For dockerless deployments, set the environment variables before starting the backend:
export KYUTAI_LLM_URL=https://api.openai.com/v1
export KYUTAI_LLM_MODEL=gpt-4.1
export KYUTAI_LLM_API_KEY=sk-...

./dockerless/start_backend.sh
Or modify the dockerless/start_backend.sh script to include these variables.

Model Selection

The LLM is displayed in conversation system prompts. The backend automatically formats the model name for readability (e.g., meta-llama/Llama-3.2-1B-Instruct becomes “meta llama Llama 3.2 1B Instruct”). For local VLLM deployment:
  • meta-llama/Llama-3.2-1B-Instruct - Lightweight, 16GB VRAM (default)
  • mistralai/Mistral-Small-3.2-24B-Instruct-2506 - Higher quality, requires more VRAM
  • google/gemma-3-12b-it - Good balance of quality and performance
For external APIs:
  • OpenAI: gpt-4o, gpt-4.1, or gpt-3.5-turbo
  • Mistral: mistral-small-latest or mistral-large-latest
  • Ollama: gemma3, llama3, mistral, etc.

LLM Requirements for Unmute

Unmute works best with models that:
Support streaming text generation for low-latency responses
Are optimized for conversational tasks
Follow instruction formats well
Generate concise responses suitable for speech

System Prompt Integration

The LLM receives a detailed system prompt that includes:
  • Base instructions for voice conversation format
  • Character-specific personality traits (from voices.yaml)
  • Language instructions (English/French support)
  • Context about the Unmute system itself
  • Current date, time, and timezone (for relevant character types)
The system prompt is generated by unmute/llm/system_prompt.py and automatically includes the LLM model name.

Performance Considerations

Latency matters: Voice conversation requires low-latency LLM responses. Local VLLM or nearby servers will provide better user experience than distant cloud APIs.

Latency Optimization Tips

  1. Use streaming: Unmute starts TTS generation before the LLM finishes the full response
  2. Choose faster models: Smaller models (1-12B parameters) respond faster than large models
  3. Local deployment: Running VLLM locally eliminates network latency
  4. Adjust max tokens: Configure --max-model-len in VLLM to balance context vs. speed

VLLM Configuration

If you’re running your own VLLM instance, optimize these parameters:
llm:
  image: vllm/vllm-openai:v0.11.0
  command:
    - "--model=meta-llama/Llama-3.2-1B-Instruct"
    - "--max-model-len=1536"  # Shorter conversations = less memory
    - "--dtype=bfloat16"      # Faster than float32
    - "--gpu-memory-utilization=0.4"  # Adjust based on your GPU

Testing Your Configuration

After configuring your external LLM:
1

Check Backend Logs

Verify the backend connects successfully:
docker compose logs backend
Look for successful LLM initialization messages.
2

Test a Conversation

Open Unmute in your browser and start a conversation. The LLM model name will be mentioned in the character’s system prompt.
3

Monitor Response Times

Use the dev mode (press D after enabling in useKeyboardShortcuts.ts) to see timing information.

Troubleshooting

  • For localhost Ollama: Ensure extra_hosts is configured correctly
  • For external APIs: Check your network can reach the API URL
  • Verify the API endpoint is correct (usually ends with /v1)
  • Verify your API key is correct and has not expired
  • Check that the API key has proper permissions
  • For Ollama, any string works as the API key (e.g., “ollama”)
  • Verify the model name is correct for your provider
  • For Ollama: Ensure you’ve pulled the model with ollama pull <model>
  • For OpenAI: Check available models at platform.openai.com/docs/models
  • Consider using a smaller/faster model
  • For cloud APIs: Check your network latency
  • For local VLLM: Increase --gpu-memory-utilization if you have spare GPU memory
  • Reduce --max-model-len to decrease memory usage

Build docs developers (and LLMs) love