WebSocket Protocol

Overview

Unmute uses a WebSocket-based protocol inspired by the OpenAI Realtime API for real-time voice conversations. The protocol handles:

Real-time audio streaming (bidirectional)
Voice conversation transcription
Session configuration
Error handling and debugging

Connection Setup

Endpoint

URL

string

required

/v1/realtime

Protocol

string

required

realtime (WebSocket subprotocol)

Port

number

Development: 8000
Production: Routed through Traefik (HTTP port 80, HTTPS port 443)

Establishing Connection

The WebSocket connection is established using the realtime subprotocol. This subprotocol identifier is required for the client to connect properly. Frontend implementation (frontend/src/app/Unmute.tsx:91-97):

const { sendMessage, lastMessage, readyState } = useWebSocket(
  webSocketUrl || null,
  {
    protocols: ["realtime"],
  },
  shouldConnect,
);

Backend implementation (unmute/main_websocket.py:310-314):

# The `subprotocol` argument is important because the client specifies what
# protocol(s) it supports and OpenAI uses "realtime" as the value.
await websocket.accept(subprotocol="realtime")

Message Structure

All messages are JSON-encoded with a common structure defined in unmute/openai_realtime_api_events.py. Every message inherits from BaseEvent which provides:

type

string

required

The event type identifier (e.g., "session.update", "response.audio.delta")

event_id

string

Unique identifier for the event (format: event_<21_random_chars>)

Client → Server Messages

Audio Input Streaming

Type: input_audio_buffer.append Streams real-time audio data from the microphone to the backend.

type

string

required

"input_audio_buffer.append"

audio

string

required

Base64-encoded Opus audio data

Audio Format:

Codec: Opus
Sample Rate: 24kHz
Channels: Mono
Encoding: Base64-encoded bytes

Frontend Example (frontend/src/app/Unmute.tsx:100-110):

const onOpusRecorded = useCallback(
  (opus: Uint8Array) => {
    sendMessage(
      JSON.stringify({
        type: "input_audio_buffer.append",
        audio: base64EncodeOpus(opus),
      }),
    );
  },
  [sendMessage],
);

Backend Processing (unmute/main_websocket.py:460-477):

if isinstance(message, ora.InputAudioBufferAppend):
    opus_bytes = base64.b64decode(message.audio)
    if wait_for_first_opus:
        # Check for first packet bit
        if opus_bytes[5] & 2:
            wait_for_first_opus = False
        else:
            continue
    pcm = await asyncio.to_thread(opus_reader.append_bytes, opus_bytes)
    
    if pcm.size:
        await handler.receive((SAMPLE_RATE, pcm[np.newaxis, :]))

Session Configuration

Type: session.update Configures the voice character and conversation instructions. The backend will not start processing until it receives this message.

type

string

required

"session.update"

session

object

required

Session configuration object

session.instructions

object

Conversation instructions (Unmute extension)

session.voice

string

Voice identifier for TTS

session.allow_recording

boolean

required

Whether to allow recording of the conversation

Frontend Example (frontend/src/app/Unmute.tsx:232-240):

sendMessage(
  JSON.stringify({
    type: "session.update",
    session: {
      instructions: unmuteConfig.instructions,
      voice: unmuteConfig.voice,
      allow_recording: recordingConsent,
    },
  }),
);

Server → Client Messages

Audio Response Streaming

Type: response.audio.delta Streams generated speech audio to the frontend.

type

string

"response.audio.delta"

event_id

string

Unique event identifier

delta

string

Base64-encoded Opus audio data chunk

Frontend Handling (frontend/src/app/Unmute.tsx:164-175):

if (data.type === "response.audio.delta") {
  const opus = base64DecodeOpus(data.delta);
  const ap = audioProcessor.current;
  if (!ap) return;

  ap.decoder.postMessage(
    {
      command: "decode",
      pages: opus,
    },
    [opus.buffer],
  );
}

Speech Transcription

Type: conversation.item.input_audio_transcription.delta Real-time transcription of user speech.

type

string

"conversation.item.input_audio_transcription.delta"

delta

string

Transcribed text chunk

start_time

number

Start time of the transcription (Unmute extension)

Frontend Handling (frontend/src/app/Unmute.tsx:186-193):

else if (data.type === "conversation.item.input_audio_transcription.delta") {
  // Transcription of the user speech
  setRawChatHistory((prev) => [
    ...prev,
    { role: "user", content: data.delta },
  ]);
}

Text Response Streaming

Type: response.text.delta Streams generated text responses for display or debugging.

type

string

"response.text.delta"

delta

string

Text chunk from the LLM response

Frontend Handling (frontend/src/app/Unmute.tsx:194-202):

else if (data.type === "response.text.delta") {
  setRawChatHistory((prev) => [
    ...prev,
    // The TTS doesn't include spaces in its messages, so add a leading space
    { role: "assistant", content: " " + data.delta },
  ]);
}

Speech Detection Events

Types:

input_audio_buffer.speech_started
input_audio_buffer.speech_stopped

Indicate when the user starts or stops speaking based on Voice Activity Detection (VAD).

type

string

"input_audio_buffer.speech_started" or "input_audio_buffer.speech_stopped"

These events are currently reported but not actively used in the Unmute frontend for UI feedback.

Response Status Updates

Type: response.created Indicates when the assistant starts generating a response.

type

string

"response.created"

response

object

Response metadata object

response.object

string

"realtime.response"

response.status

string

One of: "in_progress", "completed", "cancelled", "failed", "incomplete"

response.voice

string

Voice identifier being used

response.chat_history

array

Array of chat history objects

Error Handling

Type: error Communicates errors and warnings to the client.

type

string

"error"

error

object

Error details object

error.type

string

Error type (e.g., "warning", "fatal", "invalid_request_error")

error.code

string

Error code (optional)

error.message

string

Human-readable error message

error.param

string

Parameter that caused the error (optional)

error.details

object

Additional error details (Unmute extension)

Frontend Handling (frontend/src/app/Unmute.tsx:178-185):

else if (data.type === "error") {
  if (data.error.type === "warning") {
    console.warn(`Warning from server: ${data.error.message}`, data);
  } else {
    console.error(`Error from server: ${data.error.message}`, data);
    setErrors((prev) => [...prev, makeErrorItem(data.error.message)]);
  }
}

Unmute-Specific Events

Additional Outputs

Type: unmute.additional_outputs Provides debugging information and additional outputs.

type

string

"unmute.additional_outputs"

args

any

Debug dictionary or additional output data

Text Delta Ready

Type: unmute.response.text.delta.ready Indicates that a text delta is ready to be sent.

type

string

"unmute.response.text.delta.ready"

delta

string

Text delta content

Audio Delta Ready

Type: unmute.response.audio.delta.ready Indicates that audio samples are ready.

type

string

"unmute.response.audio.delta.ready"

number_of_samples

number

Number of audio samples ready

VAD Interruption

Type: unmute.interrupted_by_vad Indicates that the VAD interrupted the response generation.

type

string

"unmute.interrupted_by_vad"

Connection Lifecycle

Health Check: Frontend checks /v1/health endpoint before connecting

const response = await fetch(`${backendServerUrl}/v1/health`);
const data = await response.json();
// Check data.ok, data.tts_up, data.stt_up, data.llm_up

WebSocket Connection: Establish connection with realtime protocol
Session Setup: Send session.update with voice and instructions
- Backend will not process audio until this is received
Audio Streaming: Bidirectional real-time audio communication
- Client sends input_audio_buffer.append messages
- Server sends response.audio.delta messages
- Transcription and text deltas flow concurrently
Graceful Shutdown: Handle disconnection and cleanup
- Frontend stops audio processing
- Backend cleans up resources via UnmuteHandler.cleanup()

Implementation Details

Backend Message Loop

The backend uses two concurrent loops (unmute/main_websocket.py:391-403):

async with asyncio.TaskGroup() as tg:
    tg.create_task(
        receive_loop(websocket, handler, emit_queue), name="receive_loop()"
    )
    tg.create_task(
        emit_loop(websocket, handler, emit_queue), name="emit_loop()"
    )
    tg.create_task(handler.quest_manager.wait(), name="quest_manager.wait()")

receive_loop: Receives messages from the WebSocket, processes audio, handles session updates
emit_loop: Sends messages to the WebSocket from the emit queue and handler
quest_manager: Manages processing quests and tasks

Audio Encoding/Decoding

Frontend:

Uses opus-recorder library for recording microphone input
Encodes to Opus at 24kHz sample rate
Uses Web Audio API decoder for playback

Backend:

Uses sphn.OpusStreamReader for decoding incoming audio
Uses sphn.OpusStreamWriter for encoding outgoing audio
Processes audio at 24kHz sample rate

Error Handling

The protocol includes comprehensive error handling:

Invalid JSON: Returns invalid_request_error with details
Validation Errors: Returns invalid_request_error with validation details
Service Unavailable: Returns fatal error and closes connection
Warnings: Logged but don’t disrupt the connection

WebRTC - Audio processing and streaming details
System Architecture - Overall system design

System Design

Core Components

Protocols

WebSocket Protocol

Overview

Connection Setup

Endpoint

Establishing Connection

Message Structure

Client → Server Messages

Audio Input Streaming

Session Configuration

Server → Client Messages

Audio Response Streaming

Speech Transcription

Text Response Streaming

Speech Detection Events

Response Status Updates

Error Handling

Unmute-Specific Events

Additional Outputs

Text Delta Ready

Audio Delta Ready

VAD Interruption

Connection Lifecycle

Implementation Details

Backend Message Loop

Audio Encoding/Decoding

Error Handling

Build docs developers (and LLMs) love

System Design

Core Components

Protocols

​Overview

​Connection Setup

​Endpoint

​Establishing Connection

​Message Structure

​Client → Server Messages

​Audio Input Streaming

​Session Configuration

​Server → Client Messages

​Audio Response Streaming

​Speech Transcription

​Text Response Streaming

​Speech Detection Events

​Response Status Updates

​Error Handling

​Unmute-Specific Events

​Additional Outputs

​Text Delta Ready

​Audio Delta Ready

​VAD Interruption

​Connection Lifecycle

​Implementation Details

​Backend Message Loop

​Audio Encoding/Decoding

​Error Handling

​Related Documentation

Build docs developers (and LLMs) love

Overview

Connection Setup

Endpoint

Establishing Connection

Message Structure

Client → Server Messages

Audio Input Streaming

Session Configuration

Server → Client Messages

Audio Response Streaming

Speech Transcription

Text Response Streaming

Speech Detection Events

Response Status Updates

Error Handling

Unmute-Specific Events

Additional Outputs

Text Delta Ready

Audio Delta Ready

VAD Interruption

Connection Lifecycle

Implementation Details

Backend Message Loop

Audio Encoding/Decoding

Error Handling

Related Documentation