Skip to main content

Overview

The VideoAgent class extends voice capabilities with video frame processing, enabling multimodal AI conversations that can see and hear. It manages real-time video frames, audio input, and conversation history with visual context. Important: Like VoiceAgent, each VideoAgent instance is designed for one user per instance. Create a separate instance for each connection.
wss.on("connection", (socket) => {
  const agent = new VideoAgent({ model, ... });
  agent.handleSocket(socket);
  agent.on("disconnected", () => agent.destroy());
});

Constructor

new VideoAgent(options: VideoAgentOptions)
Creates a new video agent instance with the specified configuration.

VideoAgentOptions

model
LanguageModel
required
AI SDK Language Model for chat. Must be a vision-enabled model (e.g., openai('gpt-4o'), anthropic('claude-3.5-sonnet'), google('gemini-1.5-pro')) to process video frames
transcriptionModel
TranscriptionModel
AI SDK Transcription Model for converting speech to text (e.g., openai.transcription('whisper-1'))
speechModel
SpeechModel
AI SDK Speech Model for converting text to speech (e.g., openai.speech('tts-1'))
instructions
string
default:"You are a helpful multimodal AI assistant..."
System instructions for the AI model. Default emphasizes multimodal capabilities
stopWhen
(event: StepEvent) => boolean
default:"stepCountIs(5)"
Condition to stop the stream (from AI SDK)
tools
Record<string, Tool>
AI SDK tools available to the agent
endpoint
string
WebSocket server URL to connect to
voice
string
Voice ID or name for speech synthesis
speechInstructions
string
Instructions for speech synthesis behavior
outputFormat
string
Audio output format (e.g., "mp3", "opus", "pcm")
streamingSpeech
Partial<StreamingSpeechConfig>
Configuration for streaming speech generation:
  • minChunkSize (number, default: 50): Minimum characters before generating speech
  • maxChunkSize (number, default: 200): Maximum characters per chunk
  • parallelGeneration (boolean, default: true): Enable parallel TTS generation
  • maxParallelRequests (number, default: 3): Maximum parallel TTS requests
history
Partial<HistoryConfig>
Configuration for conversation history memory limits:
  • maxMessages (number, default: 100): Maximum number of messages
  • maxTotalChars (number, default: 0): Maximum total characters (0 = unlimited)
maxAudioInputSize
number
default:"10485760"
Maximum audio input size in bytes (default: 10 MB)
maxFrameInputSize
number
default:"5242880"
Maximum frame input size in bytes (default: 5 MB)
maxContextFrames
number
default:"10"
Maximum frames to keep in context buffer for conversation history
sessionId
string
Session ID for this video agent instance. Auto-generated if not provided

Example

import { VideoAgent } from "@dimension-labs/voice-agent-ai";
import { openai } from "@ai-sdk/openai";

const agent = new VideoAgent({
  model: openai("gpt-4o"), // Vision-enabled model
  transcriptionModel: openai.transcription("whisper-1"),
  speechModel: openai.speech("tts-1"),
  instructions: "You are a helpful assistant that can see and hear.",
  voice: "alloy",
  maxContextFrames: 15,
  maxFrameInputSize: 5 * 1024 * 1024, // 5 MB
});

Methods

Connection Management

connect()

async connect(url?: string): Promise<void>
Connect to a WebSocket server by URL.
url
string
WebSocket URL to connect to. If not provided, uses the endpoint from constructor options or defaults to "ws://localhost:8080"
void
Promise<void>
Resolves when connection is established
Example:
await agent.connect("ws://localhost:8080");

handleSocket()

handleSocket(socket: WebSocket): void
Attach an existing WebSocket (server-side usage).
socket
WebSocket
required
The WebSocket connection to attach
Example:
import { WebSocketServer } from "ws";

const wss = new WebSocketServer({ port: 8080 });

wss.on("connection", (socket) => {
  const agent = new VideoAgent({ model, ... });
  agent.handleSocket(socket);
});

disconnect()

disconnect(): void
Disconnect from WebSocket and stop all in-flight work.

destroy()

destroy(): void
Permanently destroy the agent, releasing all resources. Clears conversation history, frame context, and disconnects the WebSocket.

Input Methods

sendText()

async sendText(text: string): Promise<string>
Send text input for processing. If a frame has been received, it will be included in the context.
text
string
required
The text input to process
response
Promise<string>
The full text response from the AI model
Example:
const response = await agent.sendText("What do you see in this frame?");
console.log(response);

sendAudio()

async sendAudio(audioData: string): Promise<void>
Send base64-encoded audio data to be transcribed and processed.
audioData
string
required
Base64-encoded audio data
Example:
const base64Audio = "...";
await agent.sendAudio(base64Audio);

sendAudioBuffer()

async sendAudioBuffer(audioBuffer: Buffer | Uint8Array): Promise<void>
Send raw audio buffer to be transcribed and processed.
audioBuffer
Buffer | Uint8Array
required
Raw audio buffer
Example:
const audioBuffer = fs.readFileSync("audio.wav");
await agent.sendAudioBuffer(audioBuffer);

Video Frame Methods

sendFrame()

async sendFrame(
  frameData: string,
  query?: string,
  options?: { width?: number; height?: number; format?: string }
): Promise<string>
Send a video frame with optional text query for vision analysis.
frameData
string
required
Base64-encoded frame image data
query
string
Optional text query to ask about the frame
options.width
number
default:"640"
Frame width in pixels
options.height
number
default:"480"
Frame height in pixels
options.format
string
default:"webp"
Image format (e.g., “webp”, “jpeg”, “png”)
response
Promise<string>
The AI’s response (empty if no query provided)
Example:
const frameData = canvas.toDataURL("image/webp").split(",")[1];
const response = await agent.sendFrame(
  frameData,
  "What objects do you see?",
  { width: 1280, height: 720, format: "webp" }
);

requestFrameCapture()

requestFrameCapture(reason: FrameTriggerReason): void
Request the client to capture and send a frame.
reason
'scene_change' | 'user_request' | 'timer' | 'initial'
required
Reason for requesting frame capture
Example:
agent.requestFrameCapture("user_request");

Audio Processing

transcribeAudio()

async transcribeAudio(audioData: Buffer | Uint8Array): Promise<string>
Transcribe audio data to text using the configured transcription model.
audioData
Buffer | Uint8Array
required
Raw audio data to transcribe
transcript
Promise<string>
The transcribed text
Example:
const transcript = await agent.transcribeAudio(audioBuffer);

generateSpeechFromText()

async generateSpeechFromText(
  text: string,
  abortSignal?: AbortSignal
): Promise<Uint8Array>
Generate speech from text using the configured speech model.
text
string
required
Text to convert to speech
abortSignal
AbortSignal
Optional abort signal to cancel generation
audio
Promise<Uint8Array>
Generated audio data
Example:
const audio = await agent.generateSpeechFromText("Hello there");

Speech Control

interruptSpeech()

interruptSpeech(reason?: string): void
Interrupt ongoing speech generation and playback.
reason
string
default:"interrupted"
Reason for interruption
Example:
agent.interruptSpeech("user_speaking");

interruptCurrentResponse()

interruptCurrentResponse(reason?: string): void
Interrupt both the current LLM stream and ongoing speech.
reason
string
default:"interrupted"
Reason for interruption
Example:
agent.interruptCurrentResponse("user_speaking");

History & Context Management

clearHistory()

clearHistory(): void
Clear the conversation history and frame context buffer. Example:
agent.clearHistory();

getHistory()

getHistory(): ModelMessage[]
Get the current conversation history.
history
ModelMessage[]
Array of conversation messages
Example:
const history = agent.getHistory();

setHistory()

setHistory(history: ModelMessage[]): void
Set the conversation history (useful for restoring sessions).
history
ModelMessage[]
required
Array of conversation messages to restore
Example:
const savedHistory = loadFromDatabase(userId);
agent.setHistory(savedHistory);

getFrameContext()

getFrameContext(): FrameContext[]
Get the current frame context buffer.
frameContext
FrameContext[]
Array of frame context objects with sequence, timestamp, triggerReason, and frameHash
Example:
const frames = agent.getFrameContext();
console.log(`${frames.length} frames in context`);

getSessionId()

getSessionId(): string
Get the session ID for this agent instance.
sessionId
string
The session identifier
Example:
const sessionId = agent.getSessionId();

Configuration

getConfig()

getConfig(): VideoAgentConfig
Get the current video agent configuration.
config
VideoAgentConfig
Configuration object with maxContextFrames
Example:
const config = agent.getConfig();
console.log(`Max frames: ${config.maxContextFrames}`);

updateConfig()

updateConfig(config: Partial<VideoAgentConfig>): void
Update the video agent configuration.
config
Partial<VideoAgentConfig>
required
Configuration updates to apply
Example:
agent.updateConfig({ maxContextFrames: 20 });

Tool Management

registerTools()

registerTools(tools: Record<string, Tool>): void
Register additional tools for the agent to use.
tools
Record<string, Tool>
required
AI SDK tools to register
Example:
import { tool } from "ai";
import { z } from "zod";

agent.registerTools({
  identifyObject: tool({
    description: "Identify an object in the current frame",
    parameters: z.object({
      objectName: z.string(),
    }),
    execute: async ({ objectName }) => {
      return { identified: true, confidence: 0.95 };
    },
  }),
});

Properties

All properties are read-only.

connected

get connected(): boolean
Whether the agent is currently connected to a WebSocket.

processing

get processing(): boolean
Whether the agent is currently processing input.

speaking

get speaking(): boolean
Whether the agent is currently speaking.

pendingSpeechChunks

get pendingSpeechChunks(): number
Number of speech chunks pending generation or playback.

destroyed

get destroyed(): boolean
Whether the agent has been permanently destroyed.

currentFrameSequence

get currentFrameSequence(): number
The sequence number of the most recently processed frame.

hasVisualContext

get hasVisualContext(): boolean
Whether the agent currently has visual context (a frame has been received). Example:
if (agent.hasVisualContext) {
  console.log("Agent can see");
}

Events

The VideoAgent extends EventEmitter and emits all the same events as VoiceAgent, plus additional video-specific events.

Video-Specific Events

frame_received

Emitted when a video frame is received and processed.
agent.on("frame_received", (data) => {
  console.log(`Frame ${data.sequence} received:`, data.dimensions);
  console.log(`Size: ${data.size} bytes, reason: ${data.triggerReason}`);
});

frame_requested

Emitted when the agent requests a frame capture from the client.
agent.on("frame_requested", ({ reason }) => {
  console.log(`Frame requested: ${reason}`);
});

client_ready

Emitted when the client signals it’s ready for video streaming.
agent.on("client_ready", (capabilities) => {
  console.log("Client ready with capabilities:", capabilities);
});

config_changed

Emitted when the agent configuration is updated.
agent.on("config_changed", (config) => {
  console.log("Config updated:", config);
});

Connection Events

Same as VoiceAgent: connected, disconnected, error, warning

Conversation Events

Same as VoiceAgent: text, text_delta, tool_call, tool_result Note: The text event may include hasImage: true when video frames are involved:
agent.on("text", ({ role, text, hasImage }) => {
  if (hasImage) {
    console.log(`${role} (with image): ${text}`);
  }
});

Speech Events

Same as VoiceAgent: speech_start, speech_complete, speech_interrupted, speech_chunk_queued, audio_chunk, audio

Transcription Events

Same as VoiceAgent: transcription, audio_received

History Events

Same as VoiceAgent: history_cleared, history_trimmed

Types

VideoAgentConfig

interface VideoAgentConfig {
  maxContextFrames: number;
}
Backend configuration for video processing.

VideoFrame

interface VideoFrame {
  type: "video_frame";
  sessionId: string;
  sequence: number;
  timestamp: number;
  triggerReason: FrameTriggerReason;
  previousFrameRef?: string;
  image: {
    data: string;        // Base64-encoded image
    format: string;      // Image format (e.g., "webp")
    width: number;       // Frame width in pixels
    height: number;      // Frame height in pixels
  };
}
Video frame data structure sent to/from the client.

FrameContext

interface FrameContext {
  sequence: number;
  timestamp: number;
  triggerReason: FrameTriggerReason;
  frameHash: string;
  description?: string;
}
Frame context for maintaining visual conversation history.

FrameTriggerReason

type FrameTriggerReason = "scene_change" | "user_request" | "timer" | "initial";
Reason why a frame was captured.

Complete Example

import { VideoAgent } from "@dimension-labs/voice-agent-ai";
import { openai } from "@ai-sdk/openai";
import { WebSocketServer } from "ws";

const wss = new WebSocketServer({ port: 8080 });

wss.on("connection", (socket) => {
  const agent = new VideoAgent({
    model: openai("gpt-4o"),
    transcriptionModel: openai.transcription("whisper-1"),
    speechModel: openai.speech("tts-1"),
    instructions: "You are a helpful assistant that can see and hear.",
    voice: "alloy",
    maxContextFrames: 15,
  });

  agent.handleSocket(socket);

  // Listen to events
  agent.on("frame_received", (data) => {
    console.log(`Frame ${data.sequence}: ${data.size} bytes`);
  });

  agent.on("text", ({ role, text, hasImage }) => {
    console.log(`${role}${hasImage ? " (with frame)" : ""}: ${text}`);
  });

  agent.on("disconnected", () => {
    console.log("Client disconnected");
    agent.destroy();
  });

  agent.on("error", (error) => {
    console.error("Agent error:", error);
  });
});

console.log("Video agent server listening on port 8080");

Build docs developers (and LLMs) love