Architecture Overview

System Overview

Unmute is a real-time voice conversation system that allows text-based Large Language Models (LLMs) to listen and speak by wrapping them with Kyutai’s speech-to-text (STT) and text-to-speech (TTS) models. The architecture is designed for low latency and high throughput, enabling natural voice conversations.

Architecture Diagram

Core Components

Frontend (Next.js)

Technology: React, TypeScript, Next.js
Location: frontend/
Responsibilities:
- User interface and voice selection
- Microphone capture and Opus encoding
- WebSocket communication with backend
- Real-time audio playback and visualization
- Subtitle rendering

Backend (FastAPI)

Technology: Python, FastAPI, asyncio
Location: unmute/main_websocket.py, unmute/unmute_handler.py
Responsibilities:
- WebSocket orchestration
- Audio routing between STT, LLM, and TTS
- Conversation state management
- Turn-taking logic and interruption handling
- Metrics collection (Prometheus)

Speech-to-Text Service

Technology: Rust, WebSocket, Kyutai STT models
Location: services/moshi-server/
Responsibilities:
- Real-time speech transcription
- Voice Activity Detection (VAD)
- Pause prediction

LLM Server

Technology: VLLM (default), OpenAI-compatible API
Default Model: Llama 3.2 1B Instruct (configurable)
Responsibilities:
- Text response generation
- Streaming completions
- Character personality implementation

Text-to-Speech Service

Technology: Rust, WebSocket, Kyutai TTS models
Location: services/moshi-server/
Responsibilities:
- Real-time speech synthesis
- Voice cloning support
- Audio streaming with timing synchronization

Communication Protocol

Unmute uses a WebSocket-based protocol inspired by the OpenAI Realtime API:

Frontend ↔ Backend: JSON messages over WebSocket (port 80/443)
Backend ↔ STT: MessagePack binary protocol over WebSocket
Backend ↔ LLM: HTTP streaming (Server-Sent Events)
Backend ↔ TTS: MessagePack binary protocol over WebSocket

Audio Format

Sample Rate: 24 kHz
Channels: Mono
Frame Size: 480 samples (20ms)
Codec: Opus (for browser ↔ backend)
Internal Format: PCM float32 (for processing)

Deployment Architectures

Single-GPU Docker Compose

All services on one GPU
Simplest deployment
TTS latency: ~750ms on L40S

Multi-GPU Docker Compose

STT, TTS, LLM on separate GPUs
Improved latency
TTS latency: ~450ms on L40S

Docker Swarm (Production)

Multi-node deployment
Service discovery
Load balancing with Traefik
Auto-scaling support

Key Design Decisions

Asynchronous Architecture

The entire backend is built on Python’s asyncio to handle multiple concurrent connections efficiently without blocking:

Concurrent STT/TTS/LLM operations
Non-blocking I/O for WebSocket connections
Quest-based service management

Low Latency Optimizations

Streaming: All services stream data incrementally
Word-level chunking: LLM output rechunked to word boundaries for TTS
Real-time queuing: TTS audio released at precise timestamps
Frame-based processing: 20ms audio frames for minimal buffering

Interruption Handling

Users can interrupt the bot mid-sentence:

VAD-based detection (pause prediction < 0.4)
STT word detection during bot speech
Queue clearing and service cancellation
Clean conversation state transition

Service Resilience

Connection retry with exponential backoff
Health checks for all services
Graceful degradation
Prometheus metrics for monitoring

Scalability

The architecture is designed to scale horizontally:

Backend: Multiple instances behind load balancer (4 concurrent clients per instance)
STT/TTS: Service discovery with capacity management
LLM: VLLM supports batching and KV cache
Frontend: Static assets via CDN

Next Steps

System Design - Detailed data flow and timing
Core Components - Deep dive into each component
Speech-to-Text - STT implementation details
Text-to-Speech - TTS implementation details
LLM Integration - LLM integration patterns

System Design

Core Components

Protocols

Architecture Overview

System Overview

Architecture Diagram

Core Components

Frontend (Next.js)

Backend (FastAPI)

Speech-to-Text Service

LLM Server

Text-to-Speech Service

Communication Protocol

Audio Format

Deployment Architectures

Single-GPU Docker Compose

Multi-GPU Docker Compose

Docker Swarm (Production)

Key Design Decisions

Asynchronous Architecture

Low Latency Optimizations

Interruption Handling

Service Resilience

Scalability

Next Steps

Build docs developers (and LLMs) love

System Design

Core Components

Protocols

​System Overview

​Architecture Diagram

​Core Components

​Frontend (Next.js)

​Backend (FastAPI)

​Speech-to-Text Service

​LLM Server

​Text-to-Speech Service

​Communication Protocol

​Audio Format

​Deployment Architectures

​Single-GPU Docker Compose

​Multi-GPU Docker Compose

​Docker Swarm (Production)

​Key Design Decisions

​Asynchronous Architecture

​Low Latency Optimizations

​Interruption Handling

​Service Resilience

​Scalability

​Next Steps

Build docs developers (and LLMs) love

System Overview

Architecture Diagram

Core Components

Frontend (Next.js)

Backend (FastAPI)

Speech-to-Text Service

LLM Server

Text-to-Speech Service

Communication Protocol

Audio Format

Deployment Architectures

Single-GPU Docker Compose

Multi-GPU Docker Compose

Docker Swarm (Production)

Key Design Decisions

Asynchronous Architecture

Low Latency Optimizations

Interruption Handling

Service Resilience

Scalability

Next Steps