Core Interfaces
SpeechChunk
Represents a chunk of text to be converted to speech during streaming text-to-speech generation.Unique identifier for this speech chunk in the generation queue
The text content to be converted to speech
Promise that resolves to the generated audio data, or null if generation fails
StreamingSpeechConfig
Configuration for streaming speech behavior and parallel TTS generation.Minimum characters before generating speech for a chunkDefault:
50Maximum characters per chunk. Text will be split at sentence boundary before reaching this limit.Default:
200Whether to enable parallel TTS generation for multiple chunksDefault:
trueMaximum number of parallel TTS requests allowed at onceDefault:
3HistoryConfig
Configuration for conversation history memory management and automatic trimming.Maximum number of messages to keep in history. When exceeded, oldest messages are trimmed in pairs (user + assistant). Set to
0 for unlimited.Default: 100Maximum total character count across all messages. When exceeded, oldest messages are trimmed. Set to
0 for unlimited.Default: 0 (unlimited)StopWhenCondition
Type for defining when the LLM stream should stop during multi-step tool execution.Video Agent Types
VideoFrame
Video frame data structure sent to/from the client for vision analysis.Message type identifier
Unique session identifier for this video agent instance
Sequential frame number (increments with each frame)
Unix timestamp (milliseconds) when the frame was captured
Reason why this frame was captured
Hash reference to the previous frame for context
AudioData
Audio data structure for WebSocket communication.Message type identifier
Unique session identifier
Base64-encoded audio data
Audio format (e.g., “mp3”, “opus”, “wav”, “webm”)
Audio sample rate in Hz (e.g., 16000, 44100)
Audio duration in milliseconds
Unix timestamp (milliseconds) when the audio was recorded
VideoAgentConfig
Backend configuration for video processing behavior.Maximum frames to keep in context buffer for conversation historyDefault:
10FrameContext
Frame context for maintaining visual conversation history.Frame sequence number
Unix timestamp (milliseconds) of frame capture
Why this frame was captured
Unique hash identifying this frame
Optional text description of the frame content
FrameTriggerReason
Enumeration of reasons why a frame was captured.Frame captured due to detected scene change in video
Frame captured because user sent a query or request
Frame captured on a timer interval
First frame captured when video stream starts
Constants
DEFAULT_MAX_AUDIO_SIZE
Default maximum audio input size in bytes.DEFAULT_MAX_FRAME_SIZE
Default maximum frame input size in bytes for video agents.Related
Events
Learn about all events emitted by agents
VoiceAgent
Voice agent class reference
VideoAgent
Video agent class reference