Skip to main content
The Dockerless deployment allows you to run Unmute by manually starting each service without Docker. This approach is useful for development, debugging, or when you need more control over the environment.
This is more difficult to set up than Docker Compose due to various dependencies. Consider using Docker Compose unless you specifically need a Dockerless setup.

Requirements

Hardware

  • GPU: CUDA-compatible GPU with at least 16 GB VRAM
  • Architecture: x86_64
  • OS: Linux or Windows with WSL

VRAM Usage by Service

  • LLM: 6.1 GB
  • TTS: 5.3 GB
  • STT: 2.5 GB
  • Total: ~14 GB minimum

Software Dependencies

1

Install uv (Python package manager)

curl -LsSf https://astral.sh/uv/install.sh | sh
2

Install cargo (Rust toolchain)

curl https://sh.rustup.rs -sSf | sh
3

Install pnpm (Node.js package manager)

curl -fsSL https://get.pnpm.io/install.sh | sh -
4

Install CUDA 12.1

Install CUDA 12.1 via conda or from the NVIDIA website.This is required for the Rust processes (TTS and STT).

Starting Services

Each service must be started in a separate terminal session or tmux window. The repository includes helper scripts in the dockerless/ directory.

Service Startup Order

While services can be started in any order, here’s the recommended sequence:
1

Start Frontend

./dockerless/start_frontend.sh
This script:
  • Installs Node.js dependencies with pnpm install
  • Ensures the correct Node.js LTS version
  • Starts the Next.js development server on port 3000
Script contents:
start_frontend.sh
#!/bin/bash
set -ex
cd "$(dirname "$0")/.."

cd frontend
pnpm install
pnpm env use --global lts
pnpm dev
2

Start LLM Server

./dockerless/start_llm.sh
Launches VLLM with these settings:
  • Model: google/gemma-3-1b-it
  • Port: 8091
  • Max model length: 8192 tokens
  • GPU memory utilization: 30%
  • VRAM usage: ~6.1 GB
Script contents:
start_llm.sh
#!/bin/bash
set -ex
cd "$(dirname "$0")/.."

uv tool run [email protected] serve \
  --model=google/gemma-3-1b-it \
  --max-model-len=8192 \
  --dtype=bfloat16 \
  --gpu-memory-utilization=0.3 \
  --port=8091
3

Start STT Server

./dockerless/start_stt.sh
Compiles and runs the speech-to-text Rust server:
  • Port: 8090
  • VRAM usage: ~2.5 GB
Key steps:
  • Creates Python virtual environment for libpython dependency
  • Sets LD_LIBRARY_PATH for Python library linking
  • Installs moshi-server with CUDA features
  • Runs with STT-specific config
The first run will take several minutes to compile the Rust binary.
4

Start TTS Server

./dockerless/start_tts.sh
Compiles and runs the text-to-speech Rust server:
  • Port: 8089
  • VRAM usage: ~5.3 GB
Important environment setup:
export LD_LIBRARY_PATH=$(python -c 'import sysconfig; print(sysconfig.get_config_var("LIBDIR"))')
This must be set before running cargo install to ensure the Rust binary can find Python libraries.
If you see errors like no module named 'huggingface_hub', the LD_LIBRARY_PATH wasn’t set correctly before compilation. Recompile with cargo install --force.
5

Start Backend

./dockerless/start_backend.sh
Starts the FastAPI backend server:
  • Port: 8000
  • WebSocket per-message deflate disabled for better performance
  • Auto-reload enabled for development
Script contents:
start_backend.sh
#!/bin/bash
set -ex
cd "$(dirname "$0")/.."

uv run uvicorn unmute.main_websocket:app --reload --host 0.0.0.0 --port 8000 --ws-per-message-deflate=false

Accessing Unmute

Once all services are running, access the web interface at:
http://localhost:3000

Environment Variables

The backend expects these default service URLs. If you’ve changed the ports, set these environment variables before starting the backend:
export KYUTAI_STT_URL=ws://localhost:8090
export KYUTAI_TTS_URL=ws://localhost:8089
export KYUTAI_LLM_URL=http://localhost:8091
export HUGGING_FACE_HUB_TOKEN=hf_...

Using tmux for Session Management

To manage multiple services easily, use tmux:
# Create a new tmux session
tmux new -s unmute

# Start first service
./dockerless/start_frontend.sh

# Create new pane (Ctrl+B then ")
# Start second service
./dockerless/start_llm.sh

# Continue creating panes and starting services
# Ctrl+B then arrow keys to navigate between panes
# Ctrl+B then d to detach from session
# tmux attach -t unmute to reattach

Customizing Services

Change LLM Model

Edit dockerless/start_llm.sh:
uv tool run [email protected] serve \
  --model=mistralai/Mistral-7B-Instruct-v0.3 \
  --max-model-len=4096 \
  --dtype=bfloat16 \
  --gpu-memory-utilization=0.5 \
  --port=8091

Adjust GPU Memory Usage

Modify --gpu-memory-utilization in the LLM script:
  • Lower values: Less memory, supports longer conversations
  • Higher values: Faster inference, more memory required

Use Multiple GPUs

Set CUDA_VISIBLE_DEVICES before starting each service:
# Terminal 1 - STT on GPU 0
CUDA_VISIBLE_DEVICES=0 ./dockerless/start_stt.sh

# Terminal 2 - TTS on GPU 1
CUDA_VISIBLE_DEVICES=1 ./dockerless/start_tts.sh

# Terminal 3 - LLM on GPU 2
CUDA_VISIBLE_DEVICES=2 ./dockerless/start_llm.sh

Troubleshooting

Compilation Errors

Issue: Sentencepiece build fails Solution: Set the CXXFLAGS environment variable:
export CXXFLAGS="-include cstdint"
This is a fix for building Sentencepiece on GCC 15.

Missing Python Dependencies

Issue: no module named 'huggingface_hub' when running TTS/STT Solution:
  1. Activate the virtual environment: source .venv/bin/activate
  2. Set LD_LIBRARY_PATH before running cargo install
  3. Force rebuild: cargo install --force --features cuda [email protected]

Wrong moshi-server Binary

Issue: moshi-server: error: unrecognized arguments: worker Solution: You’re using the Python package binary instead of the Rust package. Update the Python package:
uv pip install moshi --upgrade  # Must be >=0.2.8

Port Already in Use

Change the port in the respective start script and update the environment variables in the backend start script.

Stopping Services

To stop a service:
  1. Switch to its terminal/tmux pane
  2. Press Ctrl+C
To stop all services at once with tmux:
tmux kill-session -t unmute

Development Workflow

The Dockerless setup is ideal for development because:
  • Frontend hot-reloading: Changes to frontend code reload automatically
  • Backend auto-reload: The --reload flag restarts the backend on code changes
  • Easy debugging: Direct access to logs in each terminal
  • Fast iteration: No Docker image rebuilding needed

Next Steps

Build docs developers (and LLMs) love