Overview
Unmute uses a WebSocket-based protocol inspired by the OpenAI Realtime API for real-time voice conversations. The protocol handles:- Real-time audio streaming (bidirectional)
- Voice conversation transcription
- Session configuration
- Error handling and debugging
Connection Setup
Endpoint
/v1/realtimerealtime (WebSocket subprotocol)- Development:
8000 - Production: Routed through Traefik (HTTP port 80, HTTPS port 443)
Establishing Connection
The WebSocket connection is established using therealtime subprotocol. This subprotocol identifier is required for the client to connect properly.
Frontend implementation (frontend/src/app/Unmute.tsx:91-97):
unmute/main_websocket.py:310-314):
Message Structure
All messages are JSON-encoded with a common structure defined inunmute/openai_realtime_api_events.py. Every message inherits from BaseEvent which provides:
The event type identifier (e.g.,
"session.update", "response.audio.delta")Unique identifier for the event (format:
event_<21_random_chars>)Client → Server Messages
Audio Input Streaming
Type:input_audio_buffer.append
Streams real-time audio data from the microphone to the backend.
"input_audio_buffer.append"Base64-encoded Opus audio data
- Codec: Opus
- Sample Rate: 24kHz
- Channels: Mono
- Encoding: Base64-encoded bytes
frontend/src/app/Unmute.tsx:100-110):
unmute/main_websocket.py:460-477):
Session Configuration
Type:session.update
Configures the voice character and conversation instructions. The backend will not start processing until it receives this message.
"session.update"Session configuration object
Conversation instructions (Unmute extension)
Voice identifier for TTS
Whether to allow recording of the conversation
frontend/src/app/Unmute.tsx:232-240):
Server → Client Messages
Audio Response Streaming
Type:response.audio.delta
Streams generated speech audio to the frontend.
"response.audio.delta"Unique event identifier
Base64-encoded Opus audio data chunk
frontend/src/app/Unmute.tsx:164-175):
Speech Transcription
Type:conversation.item.input_audio_transcription.delta
Real-time transcription of user speech.
"conversation.item.input_audio_transcription.delta"Transcribed text chunk
Start time of the transcription (Unmute extension)
frontend/src/app/Unmute.tsx:186-193):
Text Response Streaming
Type:response.text.delta
Streams generated text responses for display or debugging.
"response.text.delta"Text chunk from the LLM response
frontend/src/app/Unmute.tsx:194-202):
Speech Detection Events
Types:input_audio_buffer.speech_startedinput_audio_buffer.speech_stopped
"input_audio_buffer.speech_started" or "input_audio_buffer.speech_stopped"These events are currently reported but not actively used in the Unmute frontend for UI feedback.
Response Status Updates
Type:response.created
Indicates when the assistant starts generating a response.
"response.created"Response metadata object
"realtime.response"One of:
"in_progress", "completed", "cancelled", "failed", "incomplete"Voice identifier being used
Array of chat history objects
Error Handling
Type:error
Communicates errors and warnings to the client.
"error"Error details object
Error type (e.g.,
"warning", "fatal", "invalid_request_error")Error code (optional)
Human-readable error message
Parameter that caused the error (optional)
Additional error details (Unmute extension)
frontend/src/app/Unmute.tsx:178-185):
Unmute-Specific Events
Additional Outputs
Type:unmute.additional_outputs
Provides debugging information and additional outputs.
"unmute.additional_outputs"Debug dictionary or additional output data
Text Delta Ready
Type:unmute.response.text.delta.ready
Indicates that a text delta is ready to be sent.
"unmute.response.text.delta.ready"Text delta content
Audio Delta Ready
Type:unmute.response.audio.delta.ready
Indicates that audio samples are ready.
"unmute.response.audio.delta.ready"Number of audio samples ready
VAD Interruption
Type:unmute.interrupted_by_vad
Indicates that the VAD interrupted the response generation.
"unmute.interrupted_by_vad"Connection Lifecycle
-
Health Check: Frontend checks
/v1/healthendpoint before connecting -
WebSocket Connection: Establish connection with
realtimeprotocol -
Session Setup: Send
session.updatewith voice and instructions- Backend will not process audio until this is received
-
Audio Streaming: Bidirectional real-time audio communication
- Client sends
input_audio_buffer.appendmessages - Server sends
response.audio.deltamessages - Transcription and text deltas flow concurrently
- Client sends
-
Graceful Shutdown: Handle disconnection and cleanup
- Frontend stops audio processing
- Backend cleans up resources via
UnmuteHandler.cleanup()
Implementation Details
Backend Message Loop
The backend uses two concurrent loops (unmute/main_websocket.py:391-403):
- receive_loop: Receives messages from the WebSocket, processes audio, handles session updates
- emit_loop: Sends messages to the WebSocket from the emit queue and handler
- quest_manager: Manages processing quests and tasks
Audio Encoding/Decoding
Frontend:- Uses
opus-recorderlibrary for recording microphone input - Encodes to Opus at 24kHz sample rate
- Uses Web Audio API decoder for playback
- Uses
sphn.OpusStreamReaderfor decoding incoming audio - Uses
sphn.OpusStreamWriterfor encoding outgoing audio - Processes audio at 24kHz sample rate
Error Handling
The protocol includes comprehensive error handling:- Invalid JSON: Returns
invalid_request_errorwith details - Validation Errors: Returns
invalid_request_errorwith validation details - Service Unavailable: Returns
fatalerror and closes connection - Warnings: Logged but don’t disrupt the connection
Related Documentation
- WebRTC - Audio processing and streaming details
- System Architecture - Overall system design