Unmute’s backend uses a WebSocket protocol based on the OpenAI Realtime API, making it possible to build custom frontends or integrate Unmute into your own applications.
Protocol Overview
The Unmute backend communicates over WebSocket using a JSON-based event protocol. The protocol handles:
- Real-time bidirectional audio streaming
- Speech transcription events
- Session configuration
- Response generation status
- Error handling
WebSocket Connection
Endpoint Details
realtime (WebSocket subprotocol)
8000 (development), 80 (production via Traefik)
Establishing a Connection
Connect to the Unmute backend using the WebSocket API with the realtime subprotocol:
const ws = new WebSocket('ws://localhost:8000/v1/realtime', 'realtime');
ws.onopen = () => {
console.log('Connected to Unmute backend');
};
ws.onmessage = (event) => {
const message = JSON.parse(event.data);
handleServerEvent(message);
};
Message Structure
All messages follow a common event structure defined in unmute/openai_realtime_api_events.py:
{
"type": "event.type",
"event_id": "event_BJhGUIswO2u7vA2Cxw3Jy",
// ... additional fields specific to event type
}
Client to Server Events
Messages your frontend sends to the backend.
Session Configuration
Required: Send this before the backend will start processing audio.
{
"type": "session.update",
"session": {
"instructions": {
"type": "smalltalk",
"language": "en"
},
"voice": "unmute-prod-website/p329_022.wav",
"allow_recording": false
}
}
Defines the character’s conversation behavior. Can be:
{"type": "smalltalk", "language": "en"} - General conversation
{"type": "constant", "text": "Custom instructions"} - Custom personality
{"type": "quiz_show"} - Quiz game mode
{"type": "news"} - Tech news discussion
{"type": "guess_animal"} - Guessing game
{"type": "unmute_explanation"} - Unmute Q&A
Path to the voice file on the server (e.g., from voices.yaml)
Whether to allow conversation recording
Send user microphone audio to the backend:
{
"type": "input_audio_buffer.append",
"audio": "base64-encoded-opus-data"
}
Audio Format Requirements:
- Codec: Opus
- Sample Rate: 24kHz
- Channels: Mono
- Encoding: Base64-encoded bytes
Example: Capturing and Sending Audio
// Request microphone access
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
// Create audio context
const audioContext = new AudioContext({ sampleRate: 24000 });
const source = audioContext.createMediaStreamSource(stream);
// Process audio chunks and encode to Opus
// (implementation depends on your Opus encoder)
const processor = audioContext.createScriptProcessor(4096, 1, 1);
processor.onaudioprocess = (e) => {
const audioData = e.inputBuffer.getChannelData(0);
const opusEncoded = encodeToOpus(audioData); // Your Opus encoder
const base64Audio = btoa(String.fromCharCode(...opusEncoded));
ws.send(JSON.stringify({
type: 'input_audio_buffer.append',
audio: base64Audio
}));
};
source.connect(processor);
processor.connect(audioContext.destination);
Server to Client Events
Messages the backend sends to your frontend.
Session Updated
Confirms session configuration was applied:
{
"type": "session.updated",
"event_id": "event_abc123",
"session": {
"instructions": { "type": "smalltalk" },
"voice": "unmute-prod-website/p329_022.wav",
"allow_recording": false
}
}
Response Created
Indicates the assistant has started generating a response:
{
"type": "response.created",
"event_id": "event_xyz789",
"response": {
"object": "realtime.response",
"status": "in_progress",
"voice": "unmute-prod-website/p329_022.wav",
"chat_history": []
}
}
Audio Response Streaming
Receive generated speech audio:
{
"type": "response.audio.delta",
"event_id": "event_audio123",
"delta": "base64-encoded-opus-audio"
}
Audio Format: Same as input (Opus, 24kHz, mono, base64-encoded)
Example: Playing Audio Response
let audioQueue = [];
let isPlaying = false;
function handleServerEvent(message) {
if (message.type === 'response.audio.delta') {
const opusData = atob(message.delta); // Decode base64
const audioBuffer = decodeOpus(opusData); // Your Opus decoder
audioQueue.push(audioBuffer);
if (!isPlaying) {
playNextAudioChunk();
}
}
}
function playNextAudioChunk() {
if (audioQueue.length === 0) {
isPlaying = false;
return;
}
isPlaying = true;
const buffer = audioQueue.shift();
// Play buffer using Web Audio API
const source = audioContext.createBufferSource();
source.buffer = buffer;
source.connect(audioContext.destination);
source.onended = playNextAudioChunk;
source.start();
}
Audio Response Complete
{
"type": "response.audio.done",
"event_id": "event_done123"
}
Text Response Streaming
Receive the text being generated (useful for subtitles/debugging):
{
"type": "response.text.delta",
"event_id": "event_text123",
"delta": "Hello, how can I "
}
Text Response Complete
{
"type": "response.text.done",
"event_id": "event_text_done",
"text": "Hello, how can I help you today?"
}
Transcription Streaming
Real-time transcription of user speech:
{
"type": "conversation.item.input_audio_transcription.delta",
"event_id": "event_trans123",
"delta": "what's the weather ",
"start_time": 1234567890.123
}
The start_time field is an Unmute extension not present in the standard OpenAI Realtime API.
Speech Detection Events
// User started speaking (based on STT detection)
{
"type": "input_audio_buffer.speech_started",
"event_id": "event_speech_start"
}
// User paused (based on VAD detection)
{
"type": "input_audio_buffer.speech_stopped",
"event_id": "event_speech_stop"
}
VAD Interruption
Indicates the user interrupted the assistant’s response:
{
"type": "unmute.interrupted_by_vad",
"event_id": "event_interrupt"
}
This is an Unmute-specific event not in the OpenAI Realtime API.
Error Events
{
"type": "error",
"event_id": "event_error",
"error": {
"type": "server_error",
"code": "internal_error",
"message": "Something went wrong",
"param": null,
"details": { /* additional error context */ }
}
}
Connection Lifecycle
Health Check (Optional)
Before establishing WebSocket, verify backend is running:const response = await fetch('http://localhost:8000/v1/health');
if (response.ok) {
console.log('Backend is healthy');
}
Establish WebSocket Connection
Connect with the realtime subprotocol:const ws = new WebSocket('ws://localhost:8000/v1/realtime', 'realtime');
Configure Session
Send session.update with character and voice settings. The backend will not process audio until this is sent.
Stream Audio
Begin sending microphone audio via input_audio_buffer.append events.
Handle Responses
Process incoming audio, text, and transcription events from the backend.
Graceful Shutdown
Close the WebSocket connection when done:
Reference Implementation
Unmute includes reference client implementations you can study:
Next.js Frontend
The official frontend implementation:
- Location:
frontend/src/app/Unmute.tsx
- Framework: React with Next.js
- Features: Full WebSocket handling, audio recording, playback, UI
Python Load Test Client
A simpler client for testing and benchmarking:
- Location:
unmute/loadtest/loadtest_client.py
- Use case: Automated testing, latency measurement
- Language: Python with asyncio
# Run the load test client
uv run unmute/loadtest/loadtest_client.py --server-url ws://localhost:8000 --n-workers 16
OpenAI Realtime API Compatibility
Unmute’s protocol is inspired by the OpenAI Realtime API but includes some differences:
Unmute Extensions
These event types are specific to Unmute:
unmute.interrupted_by_vad
unmute.response.text.delta.ready
unmute.response.audio.delta.ready
unmute.additional_outputs
unmute.input_audio_buffer.append_anonymized
Simplified Parameters
Some OpenAI parameters are simplified or omitted for Unmute’s specific use case. See unmute/openai_realtime_api_events.py for the complete event schema.
Future Compatibility
The goal is to make Unmute fully compatible with the OpenAI Realtime API so frontends can work with both backends interchangeably. Contributions to improve compatibility are welcome!
Example: Minimal Custom Client
class UnmuteClient {
constructor(url) {
this.ws = new WebSocket(url, 'realtime');
this.setupEventHandlers();
}
setupEventHandlers() {
this.ws.onopen = () => this.onConnect();
this.ws.onmessage = (e) => this.onMessage(JSON.parse(e.data));
this.ws.onerror = (e) => console.error('WebSocket error:', e);
this.ws.onclose = () => console.log('Disconnected');
}
onConnect() {
// Configure session
this.send({
type: 'session.update',
session: {
instructions: { type: 'smalltalk' },
voice: 'unmute-prod-website/p329_022.wav',
allow_recording: false
}
});
}
onMessage(event) {
switch (event.type) {
case 'response.audio.delta':
this.playAudio(event.delta);
break;
case 'response.text.delta':
this.displayText(event.delta);
break;
case 'conversation.item.input_audio_transcription.delta':
this.showTranscription(event.delta);
break;
case 'error':
console.error('Server error:', event.error);
break;
}
}
send(data) {
this.ws.send(JSON.stringify(data));
}
sendAudio(base64OpusData) {
this.send({
type: 'input_audio_buffer.append',
audio: base64OpusData
});
}
// Implement these methods based on your needs:
playAudio(base64Opus) { /* decode and play */ }
displayText(text) { /* show in UI */ }
showTranscription(text) { /* show user speech */ }
}
// Usage
const client = new UnmuteClient('ws://localhost:8000/v1/realtime');
Debugging Tips
Enable subtitles in the official frontend by pressing S to see real-time transcription and text responses.
Enable dev mode by setting ALLOW_DEV_MODE = true in frontend/src/hooks/useKeyboardShortcuts.ts, then press D to see detailed debug information.
Check backend logs for detailed WebSocket event information:docker compose logs -f backend
Further Reading
- Protocol Documentation:
docs/browser_backend_communication.md in the source repository
- Event Definitions:
unmute/openai_realtime_api_events.py
- OpenAI Realtime API: platform.openai.com/docs/guides/realtime