Building a Custom Frontend

Unmute’s backend uses a WebSocket protocol based on the OpenAI Realtime API, making it possible to build custom frontends or integrate Unmute into your own applications.

Protocol Overview

The Unmute backend communicates over WebSocket using a JSON-based event protocol. The protocol handles:

Real-time bidirectional audio streaming
Speech transcription events
Session configuration
Response generation status
Error handling

WebSocket Connection

Endpoint Details

URL

string

required

/v1/realtime

Protocol

string

required

realtime (WebSocket subprotocol)

Port

number

8000 (development), 80 (production via Traefik)

Establishing a Connection

Connect to the Unmute backend using the WebSocket API with the realtime subprotocol:

const ws = new WebSocket('ws://localhost:8000/v1/realtime', 'realtime');

ws.onopen = () => {
  console.log('Connected to Unmute backend');
};

ws.onmessage = (event) => {
  const message = JSON.parse(event.data);
  handleServerEvent(message);
};

Message Structure

All messages follow a common event structure defined in unmute/openai_realtime_api_events.py:

{
  "type": "event.type",
  "event_id": "event_BJhGUIswO2u7vA2Cxw3Jy",
  // ... additional fields specific to event type
}

Client to Server Events

Messages your frontend sends to the backend.

Session Configuration

Required: Send this before the backend will start processing audio.

{
  "type": "session.update",
  "session": {
    "instructions": {
      "type": "smalltalk",
      "language": "en"
    },
    "voice": "unmute-prod-website/p329_022.wav",
    "allow_recording": false
  }
}

session.instructions

object

required

Defines the character’s conversation behavior. Can be:

{"type": "smalltalk", "language": "en"} - General conversation
{"type": "constant", "text": "Custom instructions"} - Custom personality
{"type": "quiz_show"} - Quiz game mode
{"type": "news"} - Tech news discussion
{"type": "guess_animal"} - Guessing game
{"type": "unmute_explanation"} - Unmute Q&A

session.voice

string

required

Path to the voice file on the server (e.g., from voices.yaml)

session.allow_recording

boolean

required

Whether to allow conversation recording

Audio Input Streaming

Send user microphone audio to the backend:

{
  "type": "input_audio_buffer.append",
  "audio": "base64-encoded-opus-data"
}

Audio Format Requirements:

Codec: Opus
Sample Rate: 24kHz
Channels: Mono
Encoding: Base64-encoded bytes

Example: Capturing and Sending Audio

// Request microphone access
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });

// Create audio context
const audioContext = new AudioContext({ sampleRate: 24000 });
const source = audioContext.createMediaStreamSource(stream);

// Process audio chunks and encode to Opus
// (implementation depends on your Opus encoder)
const processor = audioContext.createScriptProcessor(4096, 1, 1);

processor.onaudioprocess = (e) => {
  const audioData = e.inputBuffer.getChannelData(0);
  const opusEncoded = encodeToOpus(audioData); // Your Opus encoder
  const base64Audio = btoa(String.fromCharCode(...opusEncoded));
  
  ws.send(JSON.stringify({
    type: 'input_audio_buffer.append',
    audio: base64Audio
  }));
};

source.connect(processor);
processor.connect(audioContext.destination);

Server to Client Events

Messages the backend sends to your frontend.

Session Updated

Confirms session configuration was applied:

{
  "type": "session.updated",
  "event_id": "event_abc123",
  "session": {
    "instructions": { "type": "smalltalk" },
    "voice": "unmute-prod-website/p329_022.wav",
    "allow_recording": false
  }
}

Response Created

Indicates the assistant has started generating a response:

{
  "type": "response.created",
  "event_id": "event_xyz789",
  "response": {
    "object": "realtime.response",
    "status": "in_progress",
    "voice": "unmute-prod-website/p329_022.wav",
    "chat_history": []
  }
}

Audio Response Streaming

Receive generated speech audio:

{
  "type": "response.audio.delta",
  "event_id": "event_audio123",
  "delta": "base64-encoded-opus-audio"
}

Audio Format: Same as input (Opus, 24kHz, mono, base64-encoded)

Example: Playing Audio Response

let audioQueue = [];
let isPlaying = false;

function handleServerEvent(message) {
  if (message.type === 'response.audio.delta') {
    const opusData = atob(message.delta); // Decode base64
    const audioBuffer = decodeOpus(opusData); // Your Opus decoder
    audioQueue.push(audioBuffer);
    
    if (!isPlaying) {
      playNextAudioChunk();
    }
  }
}

function playNextAudioChunk() {
  if (audioQueue.length === 0) {
    isPlaying = false;
    return;
  }
  
  isPlaying = true;
  const buffer = audioQueue.shift();
  
  // Play buffer using Web Audio API
  const source = audioContext.createBufferSource();
  source.buffer = buffer;
  source.connect(audioContext.destination);
  source.onended = playNextAudioChunk;
  source.start();
}

Audio Response Complete

{
  "type": "response.audio.done",
  "event_id": "event_done123"
}

Text Response Streaming

Receive the text being generated (useful for subtitles/debugging):

{
  "type": "response.text.delta",
  "event_id": "event_text123",
  "delta": "Hello, how can I "
}

Text Response Complete

{
  "type": "response.text.done",
  "event_id": "event_text_done",
  "text": "Hello, how can I help you today?"
}

Transcription Streaming

Real-time transcription of user speech:

{
  "type": "conversation.item.input_audio_transcription.delta",
  "event_id": "event_trans123",
  "delta": "what's the weather ",
  "start_time": 1234567890.123
}

The start_time field is an Unmute extension not present in the standard OpenAI Realtime API.

Speech Detection Events

// User started speaking (based on STT detection)
{
  "type": "input_audio_buffer.speech_started",
  "event_id": "event_speech_start"
}

// User paused (based on VAD detection)
{
  "type": "input_audio_buffer.speech_stopped",
  "event_id": "event_speech_stop"
}

VAD Interruption

Indicates the user interrupted the assistant’s response:

{
  "type": "unmute.interrupted_by_vad",
  "event_id": "event_interrupt"
}

This is an Unmute-specific event not in the OpenAI Realtime API.

Error Events

{
  "type": "error",
  "event_id": "event_error",
  "error": {
    "type": "server_error",
    "code": "internal_error",
    "message": "Something went wrong",
    "param": null,
    "details": { /* additional error context */ }
  }
}

Connection Lifecycle

Health Check (Optional)

Before establishing WebSocket, verify backend is running:

const response = await fetch('http://localhost:8000/v1/health');
if (response.ok) {
  console.log('Backend is healthy');
}

Establish WebSocket Connection

Connect with the realtime subprotocol:

const ws = new WebSocket('ws://localhost:8000/v1/realtime', 'realtime');

Configure Session

Send session.update with character and voice settings. The backend will not process audio until this is sent.

Stream Audio

Begin sending microphone audio via input_audio_buffer.append events.

Handle Responses

Process incoming audio, text, and transcription events from the backend.

Graceful Shutdown

Close the WebSocket connection when done:

ws.close();

Reference Implementation

Unmute includes reference client implementations you can study:

Next.js Frontend

The official frontend implementation:

Location: frontend/src/app/Unmute.tsx
Framework: React with Next.js
Features: Full WebSocket handling, audio recording, playback, UI

Python Load Test Client

A simpler client for testing and benchmarking:

Location: unmute/loadtest/loadtest_client.py
Use case: Automated testing, latency measurement
Language: Python with asyncio

# Run the load test client
uv run unmute/loadtest/loadtest_client.py --server-url ws://localhost:8000 --n-workers 16

OpenAI Realtime API Compatibility

Unmute’s protocol is inspired by the OpenAI Realtime API but includes some differences:

Unmute Extensions

These event types are specific to Unmute:

unmute.interrupted_by_vad
unmute.response.text.delta.ready
unmute.response.audio.delta.ready
unmute.additional_outputs
unmute.input_audio_buffer.append_anonymized

Simplified Parameters

Some OpenAI parameters are simplified or omitted for Unmute’s specific use case. See unmute/openai_realtime_api_events.py for the complete event schema.

Future Compatibility

The goal is to make Unmute fully compatible with the OpenAI Realtime API so frontends can work with both backends interchangeably. Contributions to improve compatibility are welcome!

Example: Minimal Custom Client

class UnmuteClient {
  constructor(url) {
    this.ws = new WebSocket(url, 'realtime');
    this.setupEventHandlers();
  }
  
  setupEventHandlers() {
    this.ws.onopen = () => this.onConnect();
    this.ws.onmessage = (e) => this.onMessage(JSON.parse(e.data));
    this.ws.onerror = (e) => console.error('WebSocket error:', e);
    this.ws.onclose = () => console.log('Disconnected');
  }
  
  onConnect() {
    // Configure session
    this.send({
      type: 'session.update',
      session: {
        instructions: { type: 'smalltalk' },
        voice: 'unmute-prod-website/p329_022.wav',
        allow_recording: false
      }
    });
  }
  
  onMessage(event) {
    switch (event.type) {
      case 'response.audio.delta':
        this.playAudio(event.delta);
        break;
      case 'response.text.delta':
        this.displayText(event.delta);
        break;
      case 'conversation.item.input_audio_transcription.delta':
        this.showTranscription(event.delta);
        break;
      case 'error':
        console.error('Server error:', event.error);
        break;
    }
  }
  
  send(data) {
    this.ws.send(JSON.stringify(data));
  }
  
  sendAudio(base64OpusData) {
    this.send({
      type: 'input_audio_buffer.append',
      audio: base64OpusData
    });
  }
  
  // Implement these methods based on your needs:
  playAudio(base64Opus) { /* decode and play */ }
  displayText(text) { /* show in UI */ }
  showTranscription(text) { /* show user speech */ }
}

// Usage
const client = new UnmuteClient('ws://localhost:8000/v1/realtime');

Debugging Tips

Enable subtitles in the official frontend by pressing S to see real-time transcription and text responses.

Enable dev mode by setting ALLOW_DEV_MODE = true in frontend/src/hooks/useKeyboardShortcuts.ts, then press D to see detailed debug information.

Check backend logs for detailed WebSocket event information:

docker compose logs -f backend

Customization

Advanced

Building a Custom Frontend

Protocol Overview

WebSocket Connection

Endpoint Details

Establishing a Connection

Message Structure

Client to Server Events

Session Configuration

Audio Input Streaming

Example: Capturing and Sending Audio

Server to Client Events

Session Updated

Response Created

Audio Response Streaming

Example: Playing Audio Response

Audio Response Complete

Text Response Streaming

Text Response Complete

Transcription Streaming

Speech Detection Events

VAD Interruption

Error Events

Connection Lifecycle

Reference Implementation

Next.js Frontend

Python Load Test Client

OpenAI Realtime API Compatibility

Unmute Extensions

Simplified Parameters

Future Compatibility

Example: Minimal Custom Client

Debugging Tips

Further Reading

Build docs developers (and LLMs) love

Customization

Advanced

​Protocol Overview

​WebSocket Connection

​Endpoint Details

​Establishing a Connection

​Message Structure

​Client to Server Events

​Session Configuration

​Audio Input Streaming

​Example: Capturing and Sending Audio

​Server to Client Events

​Session Updated

​Response Created

​Audio Response Streaming

​Example: Playing Audio Response

​Audio Response Complete

​Text Response Streaming

​Text Response Complete

​Transcription Streaming

​Speech Detection Events

​VAD Interruption

​Error Events

​Connection Lifecycle

​Reference Implementation

​Next.js Frontend

​Python Load Test Client

​OpenAI Realtime API Compatibility

​Unmute Extensions

​Simplified Parameters

​Future Compatibility

​Example: Minimal Custom Client

​Debugging Tips

​Further Reading

Build docs developers (and LLMs) love

Protocol Overview

WebSocket Connection

Endpoint Details

Establishing a Connection

Message Structure

Client to Server Events

Session Configuration

Audio Input Streaming

Example: Capturing and Sending Audio

Server to Client Events

Session Updated

Response Created

Audio Response Streaming

Example: Playing Audio Response

Audio Response Complete

Text Response Streaming

Text Response Complete

Transcription Streaming

Speech Detection Events

VAD Interruption

Error Events

Connection Lifecycle

Reference Implementation

Next.js Frontend

Python Load Test Client

OpenAI Realtime API Compatibility

Unmute Extensions

Simplified Parameters

Future Compatibility

Example: Minimal Custom Client

Debugging Tips

Further Reading