Skip to main content
The VideoAgent extends voice capabilities with vision, allowing your agent to see what users are showing via webcam and respond intelligently.

Complete Video Server Example

import "dotenv/config";
import { WebSocketServer } from "ws";
import { VideoAgent } from "voice-agent-ai";
import { tool } from "ai";
import { z } from "zod";
import { openai } from "@ai-sdk/openai";
import { mkdirSync, writeFileSync } from "fs";
import { join } from "path";

// Optional: Save frames to disk for debugging
const FRAMES_DIR = join(__dirname, "frames");
mkdirSync(FRAMES_DIR, { recursive: true });

let frameCounter = 0;

function saveFrame(msg: {
    sequence?: number;
    timestamp?: number;
    triggerReason?: string;
    image: { data: string; format?: string; width?: number; height?: number };
}) {
    const idx = frameCounter++;
    const ext = msg.image.format === "jpeg" ? "jpg" : (msg.image.format || "webp");
    const filename = `frame_${String(idx).padStart(5, "0")}.${ext}`;
    const filepath = join(FRAMES_DIR, filename);

    const buf = Buffer.from(msg.image.data, "base64");
    writeFileSync(filepath, buf);

    console.log(
        `[frames] Saved ${filename} (${(buf.length / 1024).toFixed(1)} kB, ` +
        `${msg.image.width}×${msg.image.height}, ${msg.triggerReason})`
    );
}

const endpoint = process.env.VIDEO_WS_ENDPOINT || "ws://localhost:8081";
const url = new URL(endpoint);
const port = Number(url.port || 8081);
const host = url.hostname || "localhost";

// Define tools
const weatherTool = tool({
    description: "Get the weather in a location",
    inputSchema: z.object({
        location: z.string().describe("The location to get the weather for"),
    }),
    execute: async ({ location }) => ({
        location,
        temperature: 72 + Math.floor(Math.random() * 21) - 10,
        conditions: ["sunny", "cloudy", "rainy", "partly cloudy"][
            Math.floor(Math.random() * 4)
        ],
    }),
});

const timeTool = tool({
    description: "Get the current time",
    inputSchema: z.object({}),
    execute: async () => ({
        time: new Date().toLocaleTimeString(),
        timezone: Intl.DateTimeFormat().resolvedOptions().timeZone,
    }),
});

const wss = new WebSocketServer({ port, host });

wss.on("listening", () => {
    console.log(`[video-ws] listening on ${endpoint}`);
    console.log(`[video-ws] Connect your video client to ${endpoint}`);
});

wss.on("connection", (socket) => {
    console.log("[video-ws] ✓ client connected");

    const agent = new VideoAgent({
        model: openai("gpt-4o"),  // Vision-enabled model required
        transcriptionModel: openai.transcription("whisper-1"),
        speechModel: openai.speech("gpt-4o-mini-tts"),
        instructions: `You are a helpful video+voice assistant.
You can SEE what the user is showing via webcam.
Describe what you see when it helps answer the question.
Keep spoken answers concise and natural.`,
        voice: "echo",
        streamingSpeech: {
            minChunkSize: 25,
            maxChunkSize: 140,
            parallelGeneration: true,
            maxParallelRequests: 3,
        },
        tools: { getWeather: weatherTool, getTime: timeTool },
        
        // Video-specific configuration
        maxContextFrames: 6,           // Keep last 6 frames in context
        maxFrameInputSize: 2_500_000,  // ~2.5 MB max frame size
    });

    // Text and streaming events
    agent.on("text", (data: { role: string; text: string }) => {
        console.log(`[video] Text (${data.role}): ${data.text?.substring(0, 100)}...`);
    });
    
    agent.on("chunk:text_delta", (data: { text: string }) => {
        process.stdout.write(data.text || "");
    });

    // Video frame events
    agent.on("frame_received", ({ sequence, size, dimensions, triggerReason }) => {
        console.log(
            `[video] Frame #${sequence} (${triggerReason}) ` +
            `${size / 1024 | 0} kB ${dimensions.width}×${dimensions.height}`
        );
    });
    
    agent.on("frame_requested", ({ reason }) => {
        console.log(`[video] Requested frame: ${reason}`);
    });

    // Audio and transcription events
    agent.on("audio_received", ({ size, format }) => {
        console.log(`[video] Audio received: ${size} bytes, format: ${format}`);
    });
    
    agent.on("transcription", ({ text, language }) => {
        console.log(`[video] Transcription: "${text}" (${language || "unknown"})`);
    });

    // Speech events
    agent.on("speech_start", () => console.log(`[video] Speech started`));
    agent.on("speech_complete", () => console.log(`[video] Speech complete`));
    agent.on("audio_chunk", ({ chunkId, text }) => {
        console.log(`[video] Audio chunk #${chunkId}: "${text?.substring(0, 50)}..."`);
    });

    // Error handling
    agent.on("error", (error: Error) => {
        console.error(`[video] ERROR:`, error);
    });
    
    agent.on("warning", (warning: string) => {
        console.warn(`[video] WARNING:`, warning);
    });

    agent.on("disconnected", () => {
        agent.destroy();
        console.log("[video-ws] ✗ client disconnected (agent destroyed)");
    });

    // Intercept raw messages to save frames to disk (optional)
    socket.on("message", (raw) => {
        try {
            const msg = JSON.parse(raw.toString());
            if (msg.type === "video_frame" && msg.image?.data) {
                saveFrame(msg);
            }
        } catch {
            // Not JSON - ignore, agent will handle binary etc.
        }
    });

    // Hand socket to agent
    agent.handleSocket(socket);
});

process.on("SIGINT", () => {
    console.log("\n[video-ws] Shutting down...");
    wss.close(() => {
        console.log("[video-ws] Server closed");
        process.exit(0);
    });
});

Vision-Enabled Models

The VideoAgent requires a vision-capable model:

OpenAI

import { openai } from "@ai-sdk/openai";

const agent = new VideoAgent({
    model: openai("gpt-4o"),        // ✅ Supports vision
    // model: openai("gpt-4-turbo"), // ✅ Also supports vision
    // ...
});

Anthropic

import { anthropic } from "@ai-sdk/anthropic";

const agent = new VideoAgent({
    model: anthropic("claude-3.5-sonnet-20241022"),  // ✅ Supports vision
    // ...
});

Google

import { google } from "@ai-sdk/google";

const agent = new VideoAgent({
    model: google("gemini-1.5-flash"),  // ✅ Supports vision
    // model: google("gemini-1.5-pro"),  // ✅ Also supports vision
    // ...
});

Frame Management

Configuration Options

const agent = new VideoAgent({
    // Maximum frames to keep in context buffer
    // Higher = more visual history, more tokens used
    maxContextFrames: 6,  // Default: 10
    
    // Maximum frame size in bytes
    // Larger frames = better quality, more bandwidth
    maxFrameInputSize: 2_500_000,  // Default: 5 MB
    
    // ...
});

Frame Context Buffer

The agent maintains a rolling buffer of recent frames:
// Get current frame context
const frames = agent.getFrameContext();
console.log(`Buffered frames: ${frames.length}`);

frames.forEach(frame => {
    console.log(`Frame #${frame.sequence}: ${frame.triggerReason}`);
});

// Clear frame history
agent.clearHistory(); // Also clears frame buffer

Requesting Frames

You can programmatically request the client to capture a frame:
// Request frame with reason
agent.requestFrameCapture("user_request");

// Trigger reasons:
// - "scene_change": Automatic capture on scene change
// - "user_request": Manual request from server
// - "timer": Periodic capture
// - "initial": First frame on connection

Client WebSocket Messages

Video Frame

socket.send(JSON.stringify({
    type: "video_frame",
    sessionId: "session_123",
    sequence: 1,
    timestamp: Date.now(),
    triggerReason: "scene_change",
    image: {
        data: base64EncodedImage,
        format: "webp",  // or "jpeg", "png"
        width: 640,
        height: 480
    }
}));

Audio (same as VoiceAgent)

socket.send(JSON.stringify({
    type: "audio",
    sessionId: "session_123",
    data: base64AudioData,
    format: "mp3",
    timestamp: Date.now()
}));

Text Transcript

socket.send(JSON.stringify({
    type: "transcript",
    text: "What am I holding?"
}));

Server Responses

In addition to standard VoiceAgent responses, VideoAgent sends:

Frame Acknowledgment

{
    "type": "frame_ack",
    "sequence": 1,
    "timestamp": 1234567890
}

Frame Request

{
    "type": "capture_frame",
    "reason": "user_request",
    "timestamp": 1234567890
}

Session Initialization

{
    "type": "session_init",
    "sessionId": "vs_abc123_xyz789"
}

Usage Patterns

Visual Question Answering

// User asks: "What color is this?"
// Agent automatically uses latest frame from buffer
// Response: "That's a blue coffee mug."

Scene Understanding

// Agent can describe scenes
// User: "Describe what you see"
// Response: "I can see a desk with a laptop, a coffee mug, 
//            and some papers. The room appears to be an office."

Object Detection

// User: "How many people are in the room?"
// Agent analyzes frame and counts
// Response: "I can see 3 people in the frame."

Visual Context in Conversations

// User: "What's the weather like?" (shows window view)
// Agent sees sunny sky in frame
// Response: "Based on what I can see through your window, 
//            it looks sunny! Let me check the forecast..." 
// [Tool call: getWeather]

Performance Optimization

Token Usage

Each frame adds approximately 100-400 tokens depending on resolution:
// Conservative: fewer frames, less cost
maxContextFrames: 3,

// Balanced: good history, moderate cost
maxContextFrames: 6,

// Rich history: more context, higher cost
maxContextFrames: 10,

Frame Quality vs Bandwidth

// Lower quality, faster transmission
maxFrameInputSize: 500_000,  // 500 KB

// Balanced
maxFrameInputSize: 2_500_000,  // 2.5 MB

// High quality
maxFrameInputSize: 5_000_000,  // 5 MB (default)

Frame Capture Strategy

On the client side, optimize when to send frames:
// Option 1: On scene change (smart, efficient)
if (sceneChangeDetected()) {
    captureAndSendFrame("scene_change");
}

// Option 2: On user speech (contextual)
audioStream.on("start", () => {
    captureAndSendFrame("user_request");
});

// Option 3: Periodic (predictable, may be wasteful)
setInterval(() => {
    captureAndSendFrame("timer");
}, 5000); // Every 5 seconds

Example Output

[video-ws] listening on ws://localhost:8081
[video-ws] Connect your video client to ws://localhost:8081

[video-ws] ✓ client connected
[video] Frame #1 (initial) 234 kB 640×480
[frames] Saved frame_00001.webp (234.5 kB, 640×480, initial)

[video] Audio received: 45320 bytes, format: mp3
[video] Transcription: "What am I holding?" (en)
[video] Frame #2 (user_request) 228 kB 640×480
[frames] Saved frame_00002.webp (228.3 kB, 640×480, user_request)

[video] Text (user): What am I holding?
[video] Speech started
You're holding a blue coffee mug with a white handle.
[video] Audio chunk #1: "You're holding a blue coffee mug with a white..."
[video] Speech complete

Next Steps

Build docs developers (and LLMs) love