Using External LLM Servers

Unmute is designed to work with any OpenAI-compatible LLM server. By default, it uses a local VLLM instance, but you can easily switch to external providers like OpenAI, Mistral, Ollama, or any other compatible API.

How It Works

The Unmute backend communicates with the LLM server using the OpenAI-compatible API format. The LLM generates text responses based on the conversation history, which are then passed to the TTS system for speech synthesis.

Configuration

LLM configuration is set through environment variables in the backend service:

backend:
  environment:
    - KYUTAI_LLM_URL=http://llm:8000
    - KYUTAI_LLM_MODEL=meta-llama/Llama-3.2-1B-Instruct
    - KYUTAI_LLM_API_KEY=your-api-key

Environment Variables

KYUTAI_LLM_URL

string

required

The base URL of the OpenAI-compatible API endpoint

KYUTAI_LLM_MODEL

string

required

The model identifier to use for text generation

KYUTAI_LLM_API_KEY

string

API key for authentication (if required by the provider)

Provider Examples

OpenAI

To use OpenAI’s GPT models:

Get an API Key

Create an API key at platform.openai.com/api-keys

Update docker-compose.yml

Modify the backend service configuration:

backend:
  image: unmute-backend:latest
  environment:
    - KYUTAI_STT_URL=ws://stt:8080
    - KYUTAI_TTS_URL=ws://tts:8080
    - KYUTAI_LLM_URL=https://api.openai.com/v1
    - KYUTAI_LLM_MODEL=gpt-4.1
    - KYUTAI_LLM_API_KEY=sk-...
    - NEWSAPI_API_KEY=$NEWSAPI_API_KEY

Remove VLLM Service (Optional)

Since you’re using an external LLM, you can remove the llm service from docker-compose.yml to save resources:

# Remove or comment out this entire section:
# llm:
#   image: vllm/vllm-openai:v0.11.0
#   ...

Restart Services

Apply the changes:

docker compose up --build

Ollama (Local)

To use a locally running Ollama instance:

Install and Start Ollama

Install Ollama from ollama.ai and pull a model:

ollama pull gemma3

Update docker-compose.yml

Configure the backend to connect to your host machine:

backend:
  image: unmute-backend:latest
  environment:
    - KYUTAI_STT_URL=ws://stt:8080
    - KYUTAI_TTS_URL=ws://tts:8080
    - KYUTAI_LLM_URL=http://host.docker.internal:11434
    - KYUTAI_LLM_MODEL=gemma3
    - KYUTAI_LLM_API_KEY=ollama
    - NEWSAPI_API_KEY=$NEWSAPI_API_KEY
  extra_hosts:
    - "host.docker.internal:host-gateway"

The extra_hosts section allows Docker containers to access services running on the host machine.

Remove VLLM Service (Optional)

Remove the llm service definition from docker-compose.yml to free up GPU memory.

Restart Services

docker compose up --build

Mistral AI

To use Mistral’s API:

backend:
  environment:
    - KYUTAI_LLM_URL=https://api.mistral.ai/v1
    - KYUTAI_LLM_MODEL=mistral-small-latest
    - KYUTAI_LLM_API_KEY=your-mistral-api-key

Custom VLLM Server

If you’re running your own VLLM server elsewhere:

backend:
  environment:
    - KYUTAI_LLM_URL=http://your-vllm-server:8000
    - KYUTAI_LLM_MODEL=meta-llama/Llama-3.2-1B-Instruct
    # API key may not be needed for private VLLM instances

Dockerless Configuration

For dockerless deployments, set the environment variables before starting the backend:

export KYUTAI_LLM_URL=https://api.openai.com/v1
export KYUTAI_LLM_MODEL=gpt-4.1
export KYUTAI_LLM_API_KEY=sk-...

./dockerless/start_backend.sh

Or modify the dockerless/start_backend.sh script to include these variables.

Model Selection

The LLM is displayed in conversation system prompts. The backend automatically formats the model name for readability (e.g., meta-llama/Llama-3.2-1B-Instruct becomes “meta llama Llama 3.2 1B Instruct”).

Recommended Models

For local VLLM deployment:

meta-llama/Llama-3.2-1B-Instruct - Lightweight, 16GB VRAM (default)
mistralai/Mistral-Small-3.2-24B-Instruct-2506 - Higher quality, requires more VRAM
google/gemma-3-12b-it - Good balance of quality and performance

For external APIs:

OpenAI: gpt-4o, gpt-4.1, or gpt-3.5-turbo
Mistral: mistral-small-latest or mistral-large-latest
Ollama: gemma3, llama3, mistral, etc.

LLM Requirements for Unmute

Unmute works best with models that:

Support streaming text generation for low-latency responses

Are optimized for conversational tasks

Follow instruction formats well

Generate concise responses suitable for speech

System Prompt Integration

The LLM receives a detailed system prompt that includes:

Base instructions for voice conversation format
Character-specific personality traits (from voices.yaml)
Language instructions (English/French support)
Context about the Unmute system itself
Current date, time, and timezone (for relevant character types)

The system prompt is generated by unmute/llm/system_prompt.py and automatically includes the LLM model name.

Performance Considerations

Latency matters: Voice conversation requires low-latency LLM responses. Local VLLM or nearby servers will provide better user experience than distant cloud APIs.

Latency Optimization Tips

Use streaming: Unmute starts TTS generation before the LLM finishes the full response
Choose faster models: Smaller models (1-12B parameters) respond faster than large models
Local deployment: Running VLLM locally eliminates network latency
Adjust max tokens: Configure --max-model-len in VLLM to balance context vs. speed

VLLM Configuration

If you’re running your own VLLM instance, optimize these parameters:

llm:
  image: vllm/vllm-openai:v0.11.0
  command:
    - "--model=meta-llama/Llama-3.2-1B-Instruct"
    - "--max-model-len=1536"  # Shorter conversations = less memory
    - "--dtype=bfloat16"      # Faster than float32
    - "--gpu-memory-utilization=0.4"  # Adjust based on your GPU

Testing Your Configuration

After configuring your external LLM:

Check Backend Logs

Verify the backend connects successfully:

docker compose logs backend

Look for successful LLM initialization messages.

Test a Conversation

Open Unmute in your browser and start a conversation. The LLM model name will be mentioned in the character’s system prompt.

Monitor Response Times

Use the dev mode (press D after enabling in useKeyboardShortcuts.ts) to see timing information.

Troubleshooting

Connection refused errors

For localhost Ollama: Ensure extra_hosts is configured correctly
For external APIs: Check your network can reach the API URL
Verify the API endpoint is correct (usually ends with /v1)

Authentication errors

Verify your API key is correct and has not expired
Check that the API key has proper permissions
For Ollama, any string works as the API key (e.g., “ollama”)

Model not found errors

Verify the model name is correct for your provider
For Ollama: Ensure you’ve pulled the model with ollama pull <model>
For OpenAI: Check available models at platform.openai.com/docs/models

Slow response times

Consider using a smaller/faster model
For cloud APIs: Check your network latency
For local VLLM: Increase --gpu-memory-utilization if you have spare GPU memory
Reduce --max-model-len to decrease memory usage

Customization

Advanced

Using External LLM Servers

How It Works

Configuration

Environment Variables

Provider Examples

OpenAI

Ollama (Local)

Mistral AI

Custom VLLM Server

Dockerless Configuration

Model Selection

Recommended Models

LLM Requirements for Unmute

System Prompt Integration

Performance Considerations

Latency Optimization Tips

VLLM Configuration

Testing Your Configuration

Troubleshooting

Build docs developers (and LLMs) love

Customization

Advanced

​How It Works

​Configuration

​Environment Variables

​Provider Examples

​OpenAI

​Ollama (Local)

​Mistral AI

​Custom VLLM Server

​Dockerless Configuration

​Model Selection

​Recommended Models

​LLM Requirements for Unmute

​System Prompt Integration

​Performance Considerations

​Latency Optimization Tips

​VLLM Configuration

​Testing Your Configuration

​Troubleshooting

Build docs developers (and LLMs) love

How It Works

Configuration

Environment Variables

Provider Examples

OpenAI

Ollama (Local)

Mistral AI

Custom VLLM Server

Dockerless Configuration

Model Selection

Recommended Models

LLM Requirements for Unmute

System Prompt Integration

Performance Considerations

Latency Optimization Tips

VLLM Configuration

Testing Your Configuration

Troubleshooting