Skip to main content

Introduction

Unmute uses a WebSocket-based protocol inspired by the OpenAI Realtime API for real-time voice conversations. The protocol enables bidirectional streaming of audio, transcriptions, and conversation state.

Connection Details

Endpoint

ws://localhost:8000/v1/realtime
WebSocket Subprotocol: realtime The realtime subprotocol is required. Clients must specify this when establishing the connection, otherwise the server will reject the connection.

Example Connection (JavaScript)

const ws = new WebSocket(
  'ws://localhost:8000/v1/realtime',
  'realtime'
);

ws.onopen = () => {
  console.log('Connected to Unmute');
};

ws.onmessage = (event) => {
  const message = JSON.parse(event.data);
  console.log('Received:', message.type);
};

Message Format

All messages are JSON-encoded with a common structure:
{
  "type": "event.name",
  "event_id": "event_ABC123xyz",
  // ... additional fields specific to event type
}
type
string
required
The event type identifier (e.g., session.update, response.audio.delta)
event_id
string
required
Unique identifier for the event, automatically generated with format event_ followed by 21 random alphanumeric characters

Connection Lifecycle

1. Health Check (Optional)

Before connecting, check server health:
curl http://localhost:8000/v1/health
Response:
{
  "tts_up": true,
  "stt_up": true,
  "llm_up": true,
  "voice_cloning_up": true,
  "ok": true
}

2. Establish WebSocket Connection

Connect to /v1/realtime with the realtime subprotocol.

3. Configure Session

Send a session.update event to configure the voice and instructions. The backend will not start processing until it receives this message.
{
  "type": "session.update",
  "session": {
    "instructions": {
      "character": "helpful assistant",
      "scenario": "general conversation"
    },
    "voice": "default",
    "allow_recording": false
  }
}

4. Stream Audio

Begin sending input_audio_buffer.append events with microphone audio and receive response.audio.delta events with generated speech.

5. Graceful Shutdown

Close the WebSocket connection when done. The server handles cleanup automatically.

Audio Format

All audio is encoded using the Opus codec with the following specifications:
  • Sample Rate: 24 kHz
  • Channels: Mono
  • Encoding: Base64-encoded Opus bytes
Both client audio (sent to server) and server audio (received from server) use this format.

Rate Limiting

The server limits concurrent connections to 4 clients by default. If the limit is reached, the connection will be rejected with an error message.

Error Handling

The server sends error events when issues occur. See Server Events for details. Common error scenarios:
  • Invalid JSON format
  • Unrecognized event types
  • Service unavailability
  • Internal server errors

Next Steps

Client Events

Events sent from client to server

Server Events

Events sent from server to client

Session Management

Configure voice and conversation settings

Build docs developers (and LLMs) love