Overview
VideoAgent extends the voice agent architecture to support vision-enabled models like GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro. It processes video frames alongside audio/text input for multimodal conversations.
Key Differences from VoiceAgent
Frame Processing Captures and analyzes video frames from user’s camera
Multimodal Context Combines text, audio, and visual input in a single conversation
Frame History Maintains a buffer of recent frames for temporal context
Session Management Tracks session IDs and frame sequences for client sync
Vision Model Requirements
The model parameter must be a vision-enabled model to process video frames:
import { VideoAgent } from 'voice-agent-ai-sdk' ;
import { openai } from '@ai-sdk/openai' ;
import { anthropic } from '@ai-sdk/anthropic' ;
import { google } from '@ai-sdk/google' ;
// ✅ Vision-enabled models
const agent = new VideoAgent ({
model: openai ( 'gpt-4o' ), // OpenAI GPT-4o
// OR
model: anthropic ( 'claude-3.5-sonnet' ), // Anthropic Claude 3.5 Sonnet
// OR
model: google ( 'gemini-1.5-pro' ), // Google Gemini 1.5 Pro
transcriptionModel: openai . transcription ( 'whisper-1' ),
speechModel: openai . speech ( 'gpt-4o-mini-tts' ),
instructions: `You are a helpful multimodal AI assistant that can see through the user's camera and hear their voice.
When analyzing images, be concise but informative.
Keep responses conversational since they will be spoken aloud.` ,
});
Using a non-vision model (e.g., gpt-3.5-turbo) will cause errors when video frames are sent.
Frame Data Structure
Video frames are transmitted with metadata for tracking and synchronization:
interface VideoFrame {
type : 'video_frame' ;
sessionId : string ; // Unique session identifier
sequence : number ; // Frame sequence number (auto-incrementing)
timestamp : number ; // Capture timestamp (milliseconds)
triggerReason : FrameTriggerReason ; // Why this frame was captured
previousFrameRef ?: string ; // Hash of previous frame (for change detection)
image : {
data : string ; // Base64-encoded image
format : string ; // 'webp', 'jpeg', 'png'
width : number ; // Frame width in pixels
height : number ; // Frame height in pixels
};
}
type FrameTriggerReason =
| 'scene_change' // Automatic detection of scene change
| 'user_request' // User explicitly requested frame
| 'timer' // Periodic capture
| 'initial' ; // First frame of session
Sending Video Frames
From Client (WebSocket)
Clients send frames as WebSocket messages:
// Client-side example
const frameData = canvas . toDataURL ( 'image/webp' ). split ( ',' )[ 1 ]; // Base64
webSocket . send ( JSON . stringify ({
type: 'video_frame' ,
sessionId: sessionId ,
sequence: frameCount ++ ,
timestamp: Date . now (),
triggerReason: 'user_request' ,
image: {
data: frameData ,
format: 'webp' ,
width: 640 ,
height: 480 ,
},
}));
From Server (Direct API)
Send frames programmatically with optional text query:
// Send frame with query
const response = await agent . sendFrame (
base64FrameData ,
'What do you see in this image?' ,
{
width: 1280 ,
height: 720 ,
format: 'jpeg' ,
}
);
// Send frame without query (updates visual context only)
await agent . sendFrame ( base64FrameData );
Frame Context Buffer
VideoAgent maintains a sliding window of recent frames for temporal awareness:
interface FrameContext {
sequence : number ; // Frame sequence
timestamp : number ; // When captured
triggerReason : FrameTriggerReason ;
frameHash : string ; // Content hash for deduplication
description ?: string ; // Optional AI-generated description
}
Configuration
const agent = new VideoAgent ({
model: openai ( 'gpt-4o' ),
maxContextFrames: 10 , // Keep last 10 frames in buffer (default: 10)
maxFrameInputSize: 5 * 1024 * 1024 , // 5 MB limit (default: 5 MB)
});
// Get current frame context
const frames = agent . getFrameContext ();
console . log ( `Buffered ${ frames . length } frames` );
// Update configuration at runtime
agent . updateConfig ({
maxContextFrames: 20 ,
});
When the buffer exceeds maxContextFrames, the oldest frame is removed (FIFO).
Multimodal Content Building
The agent constructs multimodal messages for vision models:
// Internal implementation
private buildMultimodalContent (
text : string ,
frameData ?: string
): Array < { type : 'text' ; text : string } | { type : 'image' ; image : string } > {
const content = [];
// Add frame context summary
if (this.frameContextBuffer.length > 0) {
const contextSummary = `[Visual context: ${ this . frameContextBuffer . length } frames captured, latest at ${ new Date ( this . lastFrameTimestamp ). toISOString () } ]` ;
content . push ({ type: 'text' , text: contextSummary });
}
// Add image data
const imageData = frameData || this . currentFrameData ;
if ( imageData ) {
content . push ({ type: 'image' , image: imageData });
}
// Add user query
content.push({ type : 'text' , text });
return content;
}
This produces messages in AI SDK’s multimodal format:
{
role : 'user' ,
content : [
{ type: 'text' , text: '[Visual context: 3 frames captured...]' },
{ type: 'image' , image: 'base64EncodedImageData' },
{ type: 'text' , text: 'What am I holding?' },
],
}
Frame Capture Triggers
The agent can request frame captures from the client:
// Request frame capture with reason
agent . requestFrameCapture ( 'scene_change' );
// Client receives message
{
type : 'capture_frame' ,
reason : 'scene_change' ,
timestamp : 1678901234567 ,
}
Automatic Frame Requests
The agent automatically requests frames when:
User speaks (transcript received)
Audio input received
// Implementation
case 'transcript' :
this . interruptCurrentResponse ( 'user_speaking' );
this . requestFrameCapture ( 'user_request' ); // ← Auto-capture
await this . enqueueTextInput ( message . text );
break ;
case 'audio' :
this . interruptCurrentResponse ( 'user_speaking' );
this . requestFrameCapture ( 'user_request' ); // ← Auto-capture
await this . handleAudioInput ( message . data , message . format );
break ;
Audio + Video Processing
VideoAgent inherits all audio capabilities from VoiceAgent:
// Audio transcription
const text = await agent . transcribeAudio ( audioBuffer );
// Text-to-speech
const audioData = await agent . generateSpeechFromText ( 'Response text' );
// Combined audio + video input
await agent . sendAudio ( base64Audio ); // Transcribed to text
// Agent automatically requests current frame
// LLM receives: text query + latest frame
Session Management
Each VideoAgent instance has a unique session ID:
// Auto-generated session ID
const agent = new VideoAgent ({ model: openai ( 'gpt-4o' ) });
console . log ( agent . getSessionId ()); // "vs_abc123_xyz789"
// Custom session ID
const agent = new VideoAgent ({
model: openai ( 'gpt-4o' ),
sessionId: 'my-session-id' ,
});
Session IDs are sent to clients for synchronization:
// Sent when client connects
{
type : 'session_init' ,
sessionId : 'vs_abc123_xyz789' ,
}
Events
VideoAgent emits all VoiceAgent events plus:
Video frame successfully processed {
sequence : number ;
timestamp : number ;
triggerReason : FrameTriggerReason ;
size : number ; // Bytes
dimensions : { width : number ; height : number };
}
Frame capture requested from client {
reason : FrameTriggerReason ;
}
Client sent ready signal with capabilities {
capabilities : {
video : boolean ;
audio : boolean ;
// ... other client capabilities
}
}
Configuration updated via updateConfig()
State Properties
VideoAgent adds these properties to VoiceAgent’s state:
agent . currentFrameSequence // Current frame count
agent . hasVisualContext // Has at least one frame?
Example: Video Chat Application
import { VideoAgent } from 'voice-agent-ai-sdk' ;
import { openai } from '@ai-sdk/openai' ;
import WebSocket from 'ws' ;
const wss = new WebSocket . Server ({ port: 8080 });
wss . on ( 'connection' , ( socket ) => {
const agent = new VideoAgent ({
model: openai ( 'gpt-4o' ),
transcriptionModel: openai . transcription ( 'whisper-1' ),
speechModel: openai . speech ( 'gpt-4o-mini-tts' ),
instructions: `You are a helpful AI assistant with vision capabilities.
Describe what you see when asked.
Keep responses conversational.` ,
maxContextFrames: 5 ,
maxFrameInputSize: 5 * 1024 * 1024 ,
});
agent . handleSocket ( socket );
// Log frame events
agent . on ( 'frame_received' , ({ sequence , size , dimensions }) => {
console . log ( `Frame # ${ sequence } : ${ size } bytes, ${ dimensions . width } x ${ dimensions . height } ` );
});
agent . on ( 'text' , ({ role , text }) => {
console . log ( ` ${ role } : ${ text } ` );
});
// Request periodic frame captures (optional)
const frameTimer = setInterval (() => {
if ( agent . connected ) {
agent . requestFrameCapture ( 'timer' );
}
}, 5000 );
agent . on ( 'disconnected' , () => {
clearInterval ( frameTimer );
agent . destroy ();
});
});
console . log ( 'Video agent server running on ws://localhost:8080' );
Frame Size Limits
To prevent memory issues and excessive bandwidth:
const agent = new VideoAgent ({
model: openai ( 'gpt-4o' ),
maxFrameInputSize: 5 * 1024 * 1024 , // 5 MB (default)
});
// Frames exceeding limit trigger error event
agent . on ( 'error' , ( error ) => {
// Error: Frame too large (6.2 MB). Maximum allowed: 5.0 MB
});
Recommended frame settings:
Format: webp (best compression)
Resolution: 640x480 to 1280x720
Quality: 0.7-0.85 (JPEG/WebP quality)
Best Practices
Don’t send frames on every video frame (30-60 FPS). Instead:
Capture on user speech events (automatic)
Use scene change detection
Timer-based capture (1-5 second intervals)
On-demand user requests
Use WebP format with 0.8 quality for optimal size/quality balance: canvas . toBlob (( blob ) => {
// Convert blob to base64
}, 'image/webp' , 0.8 );
The model receives frame context automatically. No need to reference “the image” explicitly: // ❌ Redundant
await agent . sendFrame ( frameData , 'Look at the image. What is in the image?' );
// ✅ Natural
await agent . sendFrame ( frameData , 'What do you see?' );
Always destroy agents on disconnect: agent . on ( 'disconnected' , () => {
agent . destroy (); // Releases frame buffers and resources
});
Next Steps
VoiceAgent Learn about the base voice agent architecture
Streaming Speech Understand audio synthesis and chunking
WebSocket Protocol Explore video frame message types
Quick Start Build your first video agent