Skip to main content
Unmute’s backend uses a WebSocket protocol based on the OpenAI Realtime API, making it possible to build custom frontends or integrate Unmute into your own applications.

Protocol Overview

The Unmute backend communicates over WebSocket using a JSON-based event protocol. The protocol handles:
  • Real-time bidirectional audio streaming
  • Speech transcription events
  • Session configuration
  • Response generation status
  • Error handling

WebSocket Connection

Endpoint Details

URL
string
required
/v1/realtime
Protocol
string
required
realtime (WebSocket subprotocol)
Port
number
8000 (development), 80 (production via Traefik)

Establishing a Connection

Connect to the Unmute backend using the WebSocket API with the realtime subprotocol:
const ws = new WebSocket('ws://localhost:8000/v1/realtime', 'realtime');

ws.onopen = () => {
  console.log('Connected to Unmute backend');
};

ws.onmessage = (event) => {
  const message = JSON.parse(event.data);
  handleServerEvent(message);
};

Message Structure

All messages follow a common event structure defined in unmute/openai_realtime_api_events.py:
{
  "type": "event.type",
  "event_id": "event_BJhGUIswO2u7vA2Cxw3Jy",
  // ... additional fields specific to event type
}

Client to Server Events

Messages your frontend sends to the backend.

Session Configuration

Required: Send this before the backend will start processing audio.
{
  "type": "session.update",
  "session": {
    "instructions": {
      "type": "smalltalk",
      "language": "en"
    },
    "voice": "unmute-prod-website/p329_022.wav",
    "allow_recording": false
  }
}
session.instructions
object
required
Defines the character’s conversation behavior. Can be:
  • {"type": "smalltalk", "language": "en"} - General conversation
  • {"type": "constant", "text": "Custom instructions"} - Custom personality
  • {"type": "quiz_show"} - Quiz game mode
  • {"type": "news"} - Tech news discussion
  • {"type": "guess_animal"} - Guessing game
  • {"type": "unmute_explanation"} - Unmute Q&A
session.voice
string
required
Path to the voice file on the server (e.g., from voices.yaml)
session.allow_recording
boolean
required
Whether to allow conversation recording

Audio Input Streaming

Send user microphone audio to the backend:
{
  "type": "input_audio_buffer.append",
  "audio": "base64-encoded-opus-data"
}
Audio Format Requirements:
  • Codec: Opus
  • Sample Rate: 24kHz
  • Channels: Mono
  • Encoding: Base64-encoded bytes

Example: Capturing and Sending Audio

// Request microphone access
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });

// Create audio context
const audioContext = new AudioContext({ sampleRate: 24000 });
const source = audioContext.createMediaStreamSource(stream);

// Process audio chunks and encode to Opus
// (implementation depends on your Opus encoder)
const processor = audioContext.createScriptProcessor(4096, 1, 1);

processor.onaudioprocess = (e) => {
  const audioData = e.inputBuffer.getChannelData(0);
  const opusEncoded = encodeToOpus(audioData); // Your Opus encoder
  const base64Audio = btoa(String.fromCharCode(...opusEncoded));
  
  ws.send(JSON.stringify({
    type: 'input_audio_buffer.append',
    audio: base64Audio
  }));
};

source.connect(processor);
processor.connect(audioContext.destination);

Server to Client Events

Messages the backend sends to your frontend.

Session Updated

Confirms session configuration was applied:
{
  "type": "session.updated",
  "event_id": "event_abc123",
  "session": {
    "instructions": { "type": "smalltalk" },
    "voice": "unmute-prod-website/p329_022.wav",
    "allow_recording": false
  }
}

Response Created

Indicates the assistant has started generating a response:
{
  "type": "response.created",
  "event_id": "event_xyz789",
  "response": {
    "object": "realtime.response",
    "status": "in_progress",
    "voice": "unmute-prod-website/p329_022.wav",
    "chat_history": []
  }
}

Audio Response Streaming

Receive generated speech audio:
{
  "type": "response.audio.delta",
  "event_id": "event_audio123",
  "delta": "base64-encoded-opus-audio"
}
Audio Format: Same as input (Opus, 24kHz, mono, base64-encoded)

Example: Playing Audio Response

let audioQueue = [];
let isPlaying = false;

function handleServerEvent(message) {
  if (message.type === 'response.audio.delta') {
    const opusData = atob(message.delta); // Decode base64
    const audioBuffer = decodeOpus(opusData); // Your Opus decoder
    audioQueue.push(audioBuffer);
    
    if (!isPlaying) {
      playNextAudioChunk();
    }
  }
}

function playNextAudioChunk() {
  if (audioQueue.length === 0) {
    isPlaying = false;
    return;
  }
  
  isPlaying = true;
  const buffer = audioQueue.shift();
  
  // Play buffer using Web Audio API
  const source = audioContext.createBufferSource();
  source.buffer = buffer;
  source.connect(audioContext.destination);
  source.onended = playNextAudioChunk;
  source.start();
}

Audio Response Complete

{
  "type": "response.audio.done",
  "event_id": "event_done123"
}

Text Response Streaming

Receive the text being generated (useful for subtitles/debugging):
{
  "type": "response.text.delta",
  "event_id": "event_text123",
  "delta": "Hello, how can I "
}

Text Response Complete

{
  "type": "response.text.done",
  "event_id": "event_text_done",
  "text": "Hello, how can I help you today?"
}

Transcription Streaming

Real-time transcription of user speech:
{
  "type": "conversation.item.input_audio_transcription.delta",
  "event_id": "event_trans123",
  "delta": "what's the weather ",
  "start_time": 1234567890.123
}
The start_time field is an Unmute extension not present in the standard OpenAI Realtime API.

Speech Detection Events

// User started speaking (based on STT detection)
{
  "type": "input_audio_buffer.speech_started",
  "event_id": "event_speech_start"
}

// User paused (based on VAD detection)
{
  "type": "input_audio_buffer.speech_stopped",
  "event_id": "event_speech_stop"
}

VAD Interruption

Indicates the user interrupted the assistant’s response:
{
  "type": "unmute.interrupted_by_vad",
  "event_id": "event_interrupt"
}
This is an Unmute-specific event not in the OpenAI Realtime API.

Error Events

{
  "type": "error",
  "event_id": "event_error",
  "error": {
    "type": "server_error",
    "code": "internal_error",
    "message": "Something went wrong",
    "param": null,
    "details": { /* additional error context */ }
  }
}

Connection Lifecycle

1

Health Check (Optional)

Before establishing WebSocket, verify backend is running:
const response = await fetch('http://localhost:8000/v1/health');
if (response.ok) {
  console.log('Backend is healthy');
}
2

Establish WebSocket Connection

Connect with the realtime subprotocol:
const ws = new WebSocket('ws://localhost:8000/v1/realtime', 'realtime');
3

Configure Session

Send session.update with character and voice settings. The backend will not process audio until this is sent.
4

Stream Audio

Begin sending microphone audio via input_audio_buffer.append events.
5

Handle Responses

Process incoming audio, text, and transcription events from the backend.
6

Graceful Shutdown

Close the WebSocket connection when done:
ws.close();

Reference Implementation

Unmute includes reference client implementations you can study:

Next.js Frontend

The official frontend implementation:
  • Location: frontend/src/app/Unmute.tsx
  • Framework: React with Next.js
  • Features: Full WebSocket handling, audio recording, playback, UI

Python Load Test Client

A simpler client for testing and benchmarking:
  • Location: unmute/loadtest/loadtest_client.py
  • Use case: Automated testing, latency measurement
  • Language: Python with asyncio
# Run the load test client
uv run unmute/loadtest/loadtest_client.py --server-url ws://localhost:8000 --n-workers 16

OpenAI Realtime API Compatibility

Unmute’s protocol is inspired by the OpenAI Realtime API but includes some differences:

Unmute Extensions

These event types are specific to Unmute:
  • unmute.interrupted_by_vad
  • unmute.response.text.delta.ready
  • unmute.response.audio.delta.ready
  • unmute.additional_outputs
  • unmute.input_audio_buffer.append_anonymized

Simplified Parameters

Some OpenAI parameters are simplified or omitted for Unmute’s specific use case. See unmute/openai_realtime_api_events.py for the complete event schema.

Future Compatibility

The goal is to make Unmute fully compatible with the OpenAI Realtime API so frontends can work with both backends interchangeably. Contributions to improve compatibility are welcome!

Example: Minimal Custom Client

class UnmuteClient {
  constructor(url) {
    this.ws = new WebSocket(url, 'realtime');
    this.setupEventHandlers();
  }
  
  setupEventHandlers() {
    this.ws.onopen = () => this.onConnect();
    this.ws.onmessage = (e) => this.onMessage(JSON.parse(e.data));
    this.ws.onerror = (e) => console.error('WebSocket error:', e);
    this.ws.onclose = () => console.log('Disconnected');
  }
  
  onConnect() {
    // Configure session
    this.send({
      type: 'session.update',
      session: {
        instructions: { type: 'smalltalk' },
        voice: 'unmute-prod-website/p329_022.wav',
        allow_recording: false
      }
    });
  }
  
  onMessage(event) {
    switch (event.type) {
      case 'response.audio.delta':
        this.playAudio(event.delta);
        break;
      case 'response.text.delta':
        this.displayText(event.delta);
        break;
      case 'conversation.item.input_audio_transcription.delta':
        this.showTranscription(event.delta);
        break;
      case 'error':
        console.error('Server error:', event.error);
        break;
    }
  }
  
  send(data) {
    this.ws.send(JSON.stringify(data));
  }
  
  sendAudio(base64OpusData) {
    this.send({
      type: 'input_audio_buffer.append',
      audio: base64OpusData
    });
  }
  
  // Implement these methods based on your needs:
  playAudio(base64Opus) { /* decode and play */ }
  displayText(text) { /* show in UI */ }
  showTranscription(text) { /* show user speech */ }
}

// Usage
const client = new UnmuteClient('ws://localhost:8000/v1/realtime');

Debugging Tips

Enable subtitles in the official frontend by pressing S to see real-time transcription and text responses.
Enable dev mode by setting ALLOW_DEV_MODE = true in frontend/src/hooks/useKeyboardShortcuts.ts, then press D to see detailed debug information.
Check backend logs for detailed WebSocket event information:
docker compose logs -f backend

Further Reading

  • Protocol Documentation: docs/browser_backend_communication.md in the source repository
  • Event Definitions: unmute/openai_realtime_api_events.py
  • OpenAI Realtime API: platform.openai.com/docs/guides/realtime

Build docs developers (and LLMs) love