Skip to main content

System Overview

Unmute is a real-time voice conversation system that allows text-based Large Language Models (LLMs) to listen and speak by wrapping them with Kyutai’s speech-to-text (STT) and text-to-speech (TTS) models. The architecture is designed for low latency and high throughput, enabling natural voice conversations.

Architecture Diagram

Core Components

Frontend (Next.js)

  • Technology: React, TypeScript, Next.js
  • Location: frontend/
  • Responsibilities:
    • User interface and voice selection
    • Microphone capture and Opus encoding
    • WebSocket communication with backend
    • Real-time audio playback and visualization
    • Subtitle rendering

Backend (FastAPI)

  • Technology: Python, FastAPI, asyncio
  • Location: unmute/main_websocket.py, unmute/unmute_handler.py
  • Responsibilities:
    • WebSocket orchestration
    • Audio routing between STT, LLM, and TTS
    • Conversation state management
    • Turn-taking logic and interruption handling
    • Metrics collection (Prometheus)

Speech-to-Text Service

  • Technology: Rust, WebSocket, Kyutai STT models
  • Location: services/moshi-server/
  • Responsibilities:
    • Real-time speech transcription
    • Voice Activity Detection (VAD)
    • Pause prediction

LLM Server

  • Technology: VLLM (default), OpenAI-compatible API
  • Default Model: Llama 3.2 1B Instruct (configurable)
  • Responsibilities:
    • Text response generation
    • Streaming completions
    • Character personality implementation

Text-to-Speech Service

  • Technology: Rust, WebSocket, Kyutai TTS models
  • Location: services/moshi-server/
  • Responsibilities:
    • Real-time speech synthesis
    • Voice cloning support
    • Audio streaming with timing synchronization

Communication Protocol

Unmute uses a WebSocket-based protocol inspired by the OpenAI Realtime API:
  • Frontend ↔ Backend: JSON messages over WebSocket (port 80/443)
  • Backend ↔ STT: MessagePack binary protocol over WebSocket
  • Backend ↔ LLM: HTTP streaming (Server-Sent Events)
  • Backend ↔ TTS: MessagePack binary protocol over WebSocket

Audio Format

  • Sample Rate: 24 kHz
  • Channels: Mono
  • Frame Size: 480 samples (20ms)
  • Codec: Opus (for browser ↔ backend)
  • Internal Format: PCM float32 (for processing)

Deployment Architectures

Single-GPU Docker Compose

  • All services on one GPU
  • Simplest deployment
  • TTS latency: ~750ms on L40S

Multi-GPU Docker Compose

  • STT, TTS, LLM on separate GPUs
  • Improved latency
  • TTS latency: ~450ms on L40S

Docker Swarm (Production)

  • Multi-node deployment
  • Service discovery
  • Load balancing with Traefik
  • Auto-scaling support

Key Design Decisions

Asynchronous Architecture

The entire backend is built on Python’s asyncio to handle multiple concurrent connections efficiently without blocking:
  • Concurrent STT/TTS/LLM operations
  • Non-blocking I/O for WebSocket connections
  • Quest-based service management

Low Latency Optimizations

  1. Streaming: All services stream data incrementally
  2. Word-level chunking: LLM output rechunked to word boundaries for TTS
  3. Real-time queuing: TTS audio released at precise timestamps
  4. Frame-based processing: 20ms audio frames for minimal buffering

Interruption Handling

Users can interrupt the bot mid-sentence:
  • VAD-based detection (pause prediction < 0.4)
  • STT word detection during bot speech
  • Queue clearing and service cancellation
  • Clean conversation state transition

Service Resilience

  • Connection retry with exponential backoff
  • Health checks for all services
  • Graceful degradation
  • Prometheus metrics for monitoring

Scalability

The architecture is designed to scale horizontally:
  • Backend: Multiple instances behind load balancer (4 concurrent clients per instance)
  • STT/TTS: Service discovery with capacity management
  • LLM: VLLM supports batching and KV cache
  • Frontend: Static assets via CDN

Next Steps

Build docs developers (and LLMs) love