Overview
The VideoAgent class extends voice capabilities with video frame processing, enabling multimodal AI conversations that can see and hear. It manages real-time video frames, audio input, and conversation history with visual context.
Important: Like VoiceAgent, each VideoAgent instance is designed for one user per instance. Create a separate instance for each connection.
wss.on("connection", (socket) => {
const agent = new VideoAgent({ model, ... });
agent.handleSocket(socket);
agent.on("disconnected", () => agent.destroy());
});
Constructor
new VideoAgent(options: VideoAgentOptions)
Creates a new video agent instance with the specified configuration.
VideoAgentOptions
AI SDK Language Model for chat. Must be a vision-enabled model (e.g., openai('gpt-4o'), anthropic('claude-3.5-sonnet'), google('gemini-1.5-pro')) to process video frames
AI SDK Transcription Model for converting speech to text (e.g., openai.transcription('whisper-1'))
AI SDK Speech Model for converting text to speech (e.g., openai.speech('tts-1'))
instructions
string
default:"You are a helpful multimodal AI assistant..."
System instructions for the AI model. Default emphasizes multimodal capabilities
stopWhen
(event: StepEvent) => boolean
default:"stepCountIs(5)"
Condition to stop the stream (from AI SDK)
AI SDK tools available to the agent
WebSocket server URL to connect to
Voice ID or name for speech synthesis
Instructions for speech synthesis behavior
Audio output format (e.g., "mp3", "opus", "pcm")
streamingSpeech
Partial<StreamingSpeechConfig>
Configuration for streaming speech generation:
minChunkSize (number, default: 50): Minimum characters before generating speech
maxChunkSize (number, default: 200): Maximum characters per chunk
parallelGeneration (boolean, default: true): Enable parallel TTS generation
maxParallelRequests (number, default: 3): Maximum parallel TTS requests
Configuration for conversation history memory limits:
maxMessages (number, default: 100): Maximum number of messages
maxTotalChars (number, default: 0): Maximum total characters (0 = unlimited)
Maximum audio input size in bytes (default: 10 MB)
Maximum frame input size in bytes (default: 5 MB)
Maximum frames to keep in context buffer for conversation history
Session ID for this video agent instance. Auto-generated if not provided
Example
import { VideoAgent } from "@dimension-labs/voice-agent-ai";
import { openai } from "@ai-sdk/openai";
const agent = new VideoAgent({
model: openai("gpt-4o"), // Vision-enabled model
transcriptionModel: openai.transcription("whisper-1"),
speechModel: openai.speech("tts-1"),
instructions: "You are a helpful assistant that can see and hear.",
voice: "alloy",
maxContextFrames: 15,
maxFrameInputSize: 5 * 1024 * 1024, // 5 MB
});
Methods
Connection Management
connect()
async connect(url?: string): Promise<void>
Connect to a WebSocket server by URL.
WebSocket URL to connect to. If not provided, uses the endpoint from constructor options or defaults to "ws://localhost:8080"
Resolves when connection is established
Example:
await agent.connect("ws://localhost:8080");
handleSocket()
handleSocket(socket: WebSocket): void
Attach an existing WebSocket (server-side usage).
The WebSocket connection to attach
Example:
import { WebSocketServer } from "ws";
const wss = new WebSocketServer({ port: 8080 });
wss.on("connection", (socket) => {
const agent = new VideoAgent({ model, ... });
agent.handleSocket(socket);
});
disconnect()
Disconnect from WebSocket and stop all in-flight work.
destroy()
Permanently destroy the agent, releasing all resources. Clears conversation history, frame context, and disconnects the WebSocket.
sendText()
async sendText(text: string): Promise<string>
Send text input for processing. If a frame has been received, it will be included in the context.
The text input to process
The full text response from the AI model
Example:
const response = await agent.sendText("What do you see in this frame?");
console.log(response);
sendAudio()
async sendAudio(audioData: string): Promise<void>
Send base64-encoded audio data to be transcribed and processed.
Base64-encoded audio data
Example:
const base64Audio = "...";
await agent.sendAudio(base64Audio);
sendAudioBuffer()
async sendAudioBuffer(audioBuffer: Buffer | Uint8Array): Promise<void>
Send raw audio buffer to be transcribed and processed.
audioBuffer
Buffer | Uint8Array
required
Raw audio buffer
Example:
const audioBuffer = fs.readFileSync("audio.wav");
await agent.sendAudioBuffer(audioBuffer);
Video Frame Methods
sendFrame()
async sendFrame(
frameData: string,
query?: string,
options?: { width?: number; height?: number; format?: string }
): Promise<string>
Send a video frame with optional text query for vision analysis.
Base64-encoded frame image data
Optional text query to ask about the frame
Image format (e.g., “webp”, “jpeg”, “png”)
The AI’s response (empty if no query provided)
Example:
const frameData = canvas.toDataURL("image/webp").split(",")[1];
const response = await agent.sendFrame(
frameData,
"What objects do you see?",
{ width: 1280, height: 720, format: "webp" }
);
requestFrameCapture()
requestFrameCapture(reason: FrameTriggerReason): void
Request the client to capture and send a frame.
reason
'scene_change' | 'user_request' | 'timer' | 'initial'
required
Reason for requesting frame capture
Example:
agent.requestFrameCapture("user_request");
Audio Processing
transcribeAudio()
async transcribeAudio(audioData: Buffer | Uint8Array): Promise<string>
Transcribe audio data to text using the configured transcription model.
audioData
Buffer | Uint8Array
required
Raw audio data to transcribe
Example:
const transcript = await agent.transcribeAudio(audioBuffer);
generateSpeechFromText()
async generateSpeechFromText(
text: string,
abortSignal?: AbortSignal
): Promise<Uint8Array>
Generate speech from text using the configured speech model.
Text to convert to speech
Optional abort signal to cancel generation
Example:
const audio = await agent.generateSpeechFromText("Hello there");
Speech Control
interruptSpeech()
interruptSpeech(reason?: string): void
Interrupt ongoing speech generation and playback.
reason
string
default:"interrupted"
Reason for interruption
Example:
agent.interruptSpeech("user_speaking");
interruptCurrentResponse()
interruptCurrentResponse(reason?: string): void
Interrupt both the current LLM stream and ongoing speech.
reason
string
default:"interrupted"
Reason for interruption
Example:
agent.interruptCurrentResponse("user_speaking");
History & Context Management
clearHistory()
Clear the conversation history and frame context buffer.
Example:
getHistory()
getHistory(): ModelMessage[]
Get the current conversation history.
Array of conversation messages
Example:
const history = agent.getHistory();
setHistory()
setHistory(history: ModelMessage[]): void
Set the conversation history (useful for restoring sessions).
Array of conversation messages to restore
Example:
const savedHistory = loadFromDatabase(userId);
agent.setHistory(savedHistory);
getFrameContext()
getFrameContext(): FrameContext[]
Get the current frame context buffer.
Array of frame context objects with sequence, timestamp, triggerReason, and frameHash
Example:
const frames = agent.getFrameContext();
console.log(`${frames.length} frames in context`);
getSessionId()
Get the session ID for this agent instance.
Example:
const sessionId = agent.getSessionId();
Configuration
getConfig()
getConfig(): VideoAgentConfig
Get the current video agent configuration.
Configuration object with maxContextFrames
Example:
const config = agent.getConfig();
console.log(`Max frames: ${config.maxContextFrames}`);
updateConfig()
updateConfig(config: Partial<VideoAgentConfig>): void
Update the video agent configuration.
config
Partial<VideoAgentConfig>
required
Configuration updates to apply
Example:
agent.updateConfig({ maxContextFrames: 20 });
registerTools(tools: Record<string, Tool>): void
Register additional tools for the agent to use.
tools
Record<string, Tool>
required
AI SDK tools to register
Example:
import { tool } from "ai";
import { z } from "zod";
agent.registerTools({
identifyObject: tool({
description: "Identify an object in the current frame",
parameters: z.object({
objectName: z.string(),
}),
execute: async ({ objectName }) => {
return { identified: true, confidence: 0.95 };
},
}),
});
Properties
All properties are read-only.
connected
Whether the agent is currently connected to a WebSocket.
processing
get processing(): boolean
Whether the agent is currently processing input.
speaking
Whether the agent is currently speaking.
pendingSpeechChunks
get pendingSpeechChunks(): number
Number of speech chunks pending generation or playback.
destroyed
Whether the agent has been permanently destroyed.
currentFrameSequence
get currentFrameSequence(): number
The sequence number of the most recently processed frame.
hasVisualContext
get hasVisualContext(): boolean
Whether the agent currently has visual context (a frame has been received).
Example:
if (agent.hasVisualContext) {
console.log("Agent can see");
}
Events
The VideoAgent extends EventEmitter and emits all the same events as VoiceAgent, plus additional video-specific events.
Video-Specific Events
frame_received
Emitted when a video frame is received and processed.
agent.on("frame_received", (data) => {
console.log(`Frame ${data.sequence} received:`, data.dimensions);
console.log(`Size: ${data.size} bytes, reason: ${data.triggerReason}`);
});
frame_requested
Emitted when the agent requests a frame capture from the client.
agent.on("frame_requested", ({ reason }) => {
console.log(`Frame requested: ${reason}`);
});
client_ready
Emitted when the client signals it’s ready for video streaming.
agent.on("client_ready", (capabilities) => {
console.log("Client ready with capabilities:", capabilities);
});
config_changed
Emitted when the agent configuration is updated.
agent.on("config_changed", (config) => {
console.log("Config updated:", config);
});
Connection Events
Same as VoiceAgent: connected, disconnected, error, warning
Conversation Events
Same as VoiceAgent: text, text_delta, tool_call, tool_result
Note: The text event may include hasImage: true when video frames are involved:
agent.on("text", ({ role, text, hasImage }) => {
if (hasImage) {
console.log(`${role} (with image): ${text}`);
}
});
Speech Events
Same as VoiceAgent: speech_start, speech_complete, speech_interrupted, speech_chunk_queued, audio_chunk, audio
Transcription Events
Same as VoiceAgent: transcription, audio_received
History Events
Same as VoiceAgent: history_cleared, history_trimmed
Types
VideoAgentConfig
interface VideoAgentConfig {
maxContextFrames: number;
}
Backend configuration for video processing.
VideoFrame
interface VideoFrame {
type: "video_frame";
sessionId: string;
sequence: number;
timestamp: number;
triggerReason: FrameTriggerReason;
previousFrameRef?: string;
image: {
data: string; // Base64-encoded image
format: string; // Image format (e.g., "webp")
width: number; // Frame width in pixels
height: number; // Frame height in pixels
};
}
Video frame data structure sent to/from the client.
FrameContext
interface FrameContext {
sequence: number;
timestamp: number;
triggerReason: FrameTriggerReason;
frameHash: string;
description?: string;
}
Frame context for maintaining visual conversation history.
FrameTriggerReason
type FrameTriggerReason = "scene_change" | "user_request" | "timer" | "initial";
Reason why a frame was captured.
Complete Example
import { VideoAgent } from "@dimension-labs/voice-agent-ai";
import { openai } from "@ai-sdk/openai";
import { WebSocketServer } from "ws";
const wss = new WebSocketServer({ port: 8080 });
wss.on("connection", (socket) => {
const agent = new VideoAgent({
model: openai("gpt-4o"),
transcriptionModel: openai.transcription("whisper-1"),
speechModel: openai.speech("tts-1"),
instructions: "You are a helpful assistant that can see and hear.",
voice: "alloy",
maxContextFrames: 15,
});
agent.handleSocket(socket);
// Listen to events
agent.on("frame_received", (data) => {
console.log(`Frame ${data.sequence}: ${data.size} bytes`);
});
agent.on("text", ({ role, text, hasImage }) => {
console.log(`${role}${hasImage ? " (with frame)" : ""}: ${text}`);
});
agent.on("disconnected", () => {
console.log("Client disconnected");
agent.destroy();
});
agent.on("error", (error) => {
console.error("Agent error:", error);
});
});
console.log("Video agent server listening on port 8080");