Skip to main content

Overview

Unmute uses a WebSocket-based protocol inspired by the OpenAI Realtime API for real-time voice conversations. The protocol handles:
  • Real-time audio streaming (bidirectional)
  • Voice conversation transcription
  • Session configuration
  • Error handling and debugging

Connection Setup

Endpoint

URL
string
required
/v1/realtime
Protocol
string
required
realtime (WebSocket subprotocol)
Port
number
  • Development: 8000
  • Production: Routed through Traefik (HTTP port 80, HTTPS port 443)

Establishing Connection

The WebSocket connection is established using the realtime subprotocol. This subprotocol identifier is required for the client to connect properly. Frontend implementation (frontend/src/app/Unmute.tsx:91-97):
const { sendMessage, lastMessage, readyState } = useWebSocket(
  webSocketUrl || null,
  {
    protocols: ["realtime"],
  },
  shouldConnect,
);
Backend implementation (unmute/main_websocket.py:310-314):
# The `subprotocol` argument is important because the client specifies what
# protocol(s) it supports and OpenAI uses "realtime" as the value.
await websocket.accept(subprotocol="realtime")

Message Structure

All messages are JSON-encoded with a common structure defined in unmute/openai_realtime_api_events.py. Every message inherits from BaseEvent which provides:
type
string
required
The event type identifier (e.g., "session.update", "response.audio.delta")
event_id
string
Unique identifier for the event (format: event_<21_random_chars>)

Client → Server Messages

Audio Input Streaming

Type: input_audio_buffer.append Streams real-time audio data from the microphone to the backend.
type
string
required
"input_audio_buffer.append"
audio
string
required
Base64-encoded Opus audio data
Audio Format:
  • Codec: Opus
  • Sample Rate: 24kHz
  • Channels: Mono
  • Encoding: Base64-encoded bytes
Frontend Example (frontend/src/app/Unmute.tsx:100-110):
const onOpusRecorded = useCallback(
  (opus: Uint8Array) => {
    sendMessage(
      JSON.stringify({
        type: "input_audio_buffer.append",
        audio: base64EncodeOpus(opus),
      }),
    );
  },
  [sendMessage],
);
Backend Processing (unmute/main_websocket.py:460-477):
if isinstance(message, ora.InputAudioBufferAppend):
    opus_bytes = base64.b64decode(message.audio)
    if wait_for_first_opus:
        # Check for first packet bit
        if opus_bytes[5] & 2:
            wait_for_first_opus = False
        else:
            continue
    pcm = await asyncio.to_thread(opus_reader.append_bytes, opus_bytes)
    
    if pcm.size:
        await handler.receive((SAMPLE_RATE, pcm[np.newaxis, :]))

Session Configuration

Type: session.update Configures the voice character and conversation instructions. The backend will not start processing until it receives this message.
type
string
required
"session.update"
session
object
required
Session configuration object
session.instructions
object
Conversation instructions (Unmute extension)
session.voice
string
Voice identifier for TTS
session.allow_recording
boolean
required
Whether to allow recording of the conversation
Frontend Example (frontend/src/app/Unmute.tsx:232-240):
sendMessage(
  JSON.stringify({
    type: "session.update",
    session: {
      instructions: unmuteConfig.instructions,
      voice: unmuteConfig.voice,
      allow_recording: recordingConsent,
    },
  }),
);

Server → Client Messages

Audio Response Streaming

Type: response.audio.delta Streams generated speech audio to the frontend.
type
string
"response.audio.delta"
event_id
string
Unique event identifier
delta
string
Base64-encoded Opus audio data chunk
Frontend Handling (frontend/src/app/Unmute.tsx:164-175):
if (data.type === "response.audio.delta") {
  const opus = base64DecodeOpus(data.delta);
  const ap = audioProcessor.current;
  if (!ap) return;

  ap.decoder.postMessage(
    {
      command: "decode",
      pages: opus,
    },
    [opus.buffer],
  );
}

Speech Transcription

Type: conversation.item.input_audio_transcription.delta Real-time transcription of user speech.
type
string
"conversation.item.input_audio_transcription.delta"
delta
string
Transcribed text chunk
start_time
number
Start time of the transcription (Unmute extension)
Frontend Handling (frontend/src/app/Unmute.tsx:186-193):
else if (data.type === "conversation.item.input_audio_transcription.delta") {
  // Transcription of the user speech
  setRawChatHistory((prev) => [
    ...prev,
    { role: "user", content: data.delta },
  ]);
}

Text Response Streaming

Type: response.text.delta Streams generated text responses for display or debugging.
type
string
"response.text.delta"
delta
string
Text chunk from the LLM response
Frontend Handling (frontend/src/app/Unmute.tsx:194-202):
else if (data.type === "response.text.delta") {
  setRawChatHistory((prev) => [
    ...prev,
    // The TTS doesn't include spaces in its messages, so add a leading space
    { role: "assistant", content: " " + data.delta },
  ]);
}

Speech Detection Events

Types:
  • input_audio_buffer.speech_started
  • input_audio_buffer.speech_stopped
Indicate when the user starts or stops speaking based on Voice Activity Detection (VAD).
type
string
"input_audio_buffer.speech_started" or "input_audio_buffer.speech_stopped"
These events are currently reported but not actively used in the Unmute frontend for UI feedback.

Response Status Updates

Type: response.created Indicates when the assistant starts generating a response.
type
string
"response.created"
response
object
Response metadata object
response.object
string
"realtime.response"
response.status
string
One of: "in_progress", "completed", "cancelled", "failed", "incomplete"
response.voice
string
Voice identifier being used
response.chat_history
array
Array of chat history objects

Error Handling

Type: error Communicates errors and warnings to the client.
type
string
"error"
error
object
Error details object
error.type
string
Error type (e.g., "warning", "fatal", "invalid_request_error")
error.code
string
Error code (optional)
error.message
string
Human-readable error message
error.param
string
Parameter that caused the error (optional)
error.details
object
Additional error details (Unmute extension)
Frontend Handling (frontend/src/app/Unmute.tsx:178-185):
else if (data.type === "error") {
  if (data.error.type === "warning") {
    console.warn(`Warning from server: ${data.error.message}`, data);
  } else {
    console.error(`Error from server: ${data.error.message}`, data);
    setErrors((prev) => [...prev, makeErrorItem(data.error.message)]);
  }
}

Unmute-Specific Events

Additional Outputs

Type: unmute.additional_outputs Provides debugging information and additional outputs.
type
string
"unmute.additional_outputs"
args
any
Debug dictionary or additional output data

Text Delta Ready

Type: unmute.response.text.delta.ready Indicates that a text delta is ready to be sent.
type
string
"unmute.response.text.delta.ready"
delta
string
Text delta content

Audio Delta Ready

Type: unmute.response.audio.delta.ready Indicates that audio samples are ready.
type
string
"unmute.response.audio.delta.ready"
number_of_samples
number
Number of audio samples ready

VAD Interruption

Type: unmute.interrupted_by_vad Indicates that the VAD interrupted the response generation.
type
string
"unmute.interrupted_by_vad"

Connection Lifecycle

  1. Health Check: Frontend checks /v1/health endpoint before connecting
    const response = await fetch(`${backendServerUrl}/v1/health`);
    const data = await response.json();
    // Check data.ok, data.tts_up, data.stt_up, data.llm_up
    
  2. WebSocket Connection: Establish connection with realtime protocol
  3. Session Setup: Send session.update with voice and instructions
    • Backend will not process audio until this is received
  4. Audio Streaming: Bidirectional real-time audio communication
    • Client sends input_audio_buffer.append messages
    • Server sends response.audio.delta messages
    • Transcription and text deltas flow concurrently
  5. Graceful Shutdown: Handle disconnection and cleanup
    • Frontend stops audio processing
    • Backend cleans up resources via UnmuteHandler.cleanup()

Implementation Details

Backend Message Loop

The backend uses two concurrent loops (unmute/main_websocket.py:391-403):
async with asyncio.TaskGroup() as tg:
    tg.create_task(
        receive_loop(websocket, handler, emit_queue), name="receive_loop()"
    )
    tg.create_task(
        emit_loop(websocket, handler, emit_queue), name="emit_loop()"
    )
    tg.create_task(handler.quest_manager.wait(), name="quest_manager.wait()")
  • receive_loop: Receives messages from the WebSocket, processes audio, handles session updates
  • emit_loop: Sends messages to the WebSocket from the emit queue and handler
  • quest_manager: Manages processing quests and tasks

Audio Encoding/Decoding

Frontend:
  • Uses opus-recorder library for recording microphone input
  • Encodes to Opus at 24kHz sample rate
  • Uses Web Audio API decoder for playback
Backend:
  • Uses sphn.OpusStreamReader for decoding incoming audio
  • Uses sphn.OpusStreamWriter for encoding outgoing audio
  • Processes audio at 24kHz sample rate

Error Handling

The protocol includes comprehensive error handling:
  • Invalid JSON: Returns invalid_request_error with details
  • Validation Errors: Returns invalid_request_error with validation details
  • Service Unavailable: Returns fatal error and closes connection
  • Warnings: Logged but don’t disrupt the connection

Build docs developers (and LLMs) love