Overview
The Voice Agent SDK uses WebSocket for bidirectional real-time communication between clients and agents. The protocol supports:- Audio/video frame transmission
- Text transcripts
- Streaming text responses
- Speech audio chunks
- Tool execution events
- Lifecycle management
Connection Lifecycle
Message Format
All messages are JSON objects with atype field:
Client → Agent Messages
transcript
Send transcribed text from speech-to-text (bypasses agent transcription).- Interrupts current LLM stream and speech
- Queues text input for processing
- (VideoAgent only) Requests frame capture
audio
Send audio data for transcription by the agent.- Interrupts current response
- Transcribes audio using
transcriptionModel - Emits
transcriptionevent - Queues transcribed text for processing
- (VideoAgent only) Requests frame capture
video_frame
(VideoAgent only) Send a video frame for vision analysis.- Validates frame size (rejects if exceeds
maxFrameInputSize) - Updates visual context buffer
- Emits
frame_receivedevent - Sends
frame_ackconfirmation
interrupt
Request cancellation of current LLM stream and speech generation.- Aborts current LLM stream via
AbortController - Clears speech queue
- Sends
speech_interruptedmessage
client_ready
(VideoAgent only) Signal that client is ready with capabilities.- Sends
session_initwith session ID - Emits
client_readyevent
Agent → Client Messages
text_delta
Streaming text token from LLM.reasoning_delta
Streaming reasoning token (models with reasoning support).tool_call
Tool invocation detected in LLM stream.tool_result
Tool execution completed.audio_chunk
Streaming audio chunk (TTS generated).audio
Full audio response (non-streaming fallback).speech_stream_start
Speech generation started (streaming mode).speech_stream_end
All speech chunks sent.speech_interrupted
Speech generation cancelled.response_complete
Full LLM response finished.capture_frame
(VideoAgent only) Request client to capture and send a video frame.frame_ack
(VideoAgent only) Acknowledgment that frame was received.session_init
(VideoAgent only) Session initialized with ID.Error Handling
Errors are emitted as agent events, not WebSocket messages:Example: Full Client Implementation
Security Considerations
Protocol Versions
The current protocol is v1 (implicit). Future versions may include a version field:Next Steps
VoiceAgent
Learn about the voice agent architecture
VideoAgent
Understand video frame message types
Quick Start
Build your first WebSocket client
API Reference
Full API documentation