VideoAgent - Voice Agent AI SDK

Overview

The VideoAgent class extends voice capabilities with video frame processing, enabling multimodal AI conversations that can see and hear. It manages real-time video frames, audio input, and conversation history with visual context. Important: Like VoiceAgent, each VideoAgent instance is designed for one user per instance. Create a separate instance for each connection.

wss.on("connection", (socket) => {
  const agent = new VideoAgent({ model, ... });
  agent.handleSocket(socket);
  agent.on("disconnected", () => agent.destroy());
});

Constructor

new VideoAgent(options: VideoAgentOptions)

Creates a new video agent instance with the specified configuration.

VideoAgentOptions

model

LanguageModel

required

AI SDK Language Model for chat. Must be a vision-enabled model (e.g., openai('gpt-4o'), anthropic('claude-3.5-sonnet'), google('gemini-1.5-pro')) to process video frames

transcriptionModel

TranscriptionModel

AI SDK Transcription Model for converting speech to text (e.g., openai.transcription('whisper-1'))

speechModel

SpeechModel

AI SDK Speech Model for converting text to speech (e.g., openai.speech('tts-1'))

instructions

string

default:"You are a helpful multimodal AI assistant..."

System instructions for the AI model. Default emphasizes multimodal capabilities

stopWhen

(event: StepEvent) => boolean

default:"stepCountIs(5)"

Condition to stop the stream (from AI SDK)

tools

Record<string, Tool>

AI SDK tools available to the agent

endpoint

string

WebSocket server URL to connect to

voice

string

Voice ID or name for speech synthesis

speechInstructions

string

Instructions for speech synthesis behavior

outputFormat

string

Audio output format (e.g., "mp3", "opus", "pcm")

streamingSpeech

Partial<StreamingSpeechConfig>

Configuration for streaming speech generation:

minChunkSize (number, default: 50): Minimum characters before generating speech
maxChunkSize (number, default: 200): Maximum characters per chunk
parallelGeneration (boolean, default: true): Enable parallel TTS generation
maxParallelRequests (number, default: 3): Maximum parallel TTS requests

history

Partial<HistoryConfig>

Configuration for conversation history memory limits:

maxMessages (number, default: 100): Maximum number of messages
maxTotalChars (number, default: 0): Maximum total characters (0 = unlimited)

maxAudioInputSize

number

default:"10485760"

Maximum audio input size in bytes (default: 10 MB)

maxFrameInputSize

number

default:"5242880"

Maximum frame input size in bytes (default: 5 MB)

maxContextFrames

number

default:"10"

Maximum frames to keep in context buffer for conversation history

sessionId

string

Session ID for this video agent instance. Auto-generated if not provided

Example

import { VideoAgent } from "@dimension-labs/voice-agent-ai";
import { openai } from "@ai-sdk/openai";

const agent = new VideoAgent({
  model: openai("gpt-4o"), // Vision-enabled model
  transcriptionModel: openai.transcription("whisper-1"),
  speechModel: openai.speech("tts-1"),
  instructions: "You are a helpful assistant that can see and hear.",
  voice: "alloy",
  maxContextFrames: 15,
  maxFrameInputSize: 5 * 1024 * 1024, // 5 MB
});

Methods

Connection Management

connect()

async connect(url?: string): Promise<void>

Connect to a WebSocket server by URL.

url

string

WebSocket URL to connect to. If not provided, uses the endpoint from constructor options or defaults to "ws://localhost:8080"

void

Promise<void>

Resolves when connection is established

Example:

await agent.connect("ws://localhost:8080");

handleSocket()

handleSocket(socket: WebSocket): void

Attach an existing WebSocket (server-side usage).

socket

WebSocket

required

The WebSocket connection to attach

Example:

import { WebSocketServer } from "ws";

const wss = new WebSocketServer({ port: 8080 });

wss.on("connection", (socket) => {
  const agent = new VideoAgent({ model, ... });
  agent.handleSocket(socket);
});

disconnect()

disconnect(): void

Disconnect from WebSocket and stop all in-flight work.

destroy()

destroy(): void

Permanently destroy the agent, releasing all resources. Clears conversation history, frame context, and disconnects the WebSocket.

Input Methods

sendText()

async sendText(text: string): Promise<string>

Send text input for processing. If a frame has been received, it will be included in the context.

text

string

required

The text input to process

response

Promise<string>

The full text response from the AI model

Example:

const response = await agent.sendText("What do you see in this frame?");
console.log(response);

sendAudio()

async sendAudio(audioData: string): Promise<void>

Send base64-encoded audio data to be transcribed and processed.

audioData

string

required

Base64-encoded audio data

Example:

const base64Audio = "...";
await agent.sendAudio(base64Audio);

sendAudioBuffer()

async sendAudioBuffer(audioBuffer: Buffer | Uint8Array): Promise<void>

Send raw audio buffer to be transcribed and processed.

audioBuffer

Buffer | Uint8Array

required

Raw audio buffer

Example:

const audioBuffer = fs.readFileSync("audio.wav");
await agent.sendAudioBuffer(audioBuffer);

Video Frame Methods

sendFrame()

async sendFrame(
  frameData: string,
  query?: string,
  options?: { width?: number; height?: number; format?: string }
): Promise<string>

Send a video frame with optional text query for vision analysis.

frameData

string

required

Base64-encoded frame image data

query

string

Optional text query to ask about the frame

options.width

number

default:"640"

Frame width in pixels

options.height

number

default:"480"

Frame height in pixels

options.format

string

default:"webp"

Image format (e.g., “webp”, “jpeg”, “png”)

response

Promise<string>

The AI’s response (empty if no query provided)

Example:

const frameData = canvas.toDataURL("image/webp").split(",")[1];
const response = await agent.sendFrame(
  frameData,
  "What objects do you see?",
  { width: 1280, height: 720, format: "webp" }
);

requestFrameCapture()

requestFrameCapture(reason: FrameTriggerReason): void

Request the client to capture and send a frame.

reason

'scene_change' | 'user_request' | 'timer' | 'initial'

required

Reason for requesting frame capture

Example:

agent.requestFrameCapture("user_request");

Audio Processing

transcribeAudio()

async transcribeAudio(audioData: Buffer | Uint8Array): Promise<string>

Transcribe audio data to text using the configured transcription model.

audioData

Buffer | Uint8Array

required

Raw audio data to transcribe

transcript

Promise<string>

The transcribed text

Example:

const transcript = await agent.transcribeAudio(audioBuffer);

generateSpeechFromText()

async generateSpeechFromText(
  text: string,
  abortSignal?: AbortSignal
): Promise<Uint8Array>

Generate speech from text using the configured speech model.

text

string

required

Text to convert to speech

abortSignal

AbortSignal

Optional abort signal to cancel generation

audio

Promise<Uint8Array>

Generated audio data

Example:

const audio = await agent.generateSpeechFromText("Hello there");

Speech Control

interruptSpeech()

interruptSpeech(reason?: string): void

Interrupt ongoing speech generation and playback.

reason

string

default:"interrupted"

Reason for interruption

Example:

agent.interruptSpeech("user_speaking");

interruptCurrentResponse()

interruptCurrentResponse(reason?: string): void

Interrupt both the current LLM stream and ongoing speech.

reason

string

default:"interrupted"

Reason for interruption

Example:

agent.interruptCurrentResponse("user_speaking");

History & Context Management

clearHistory()

clearHistory(): void

Clear the conversation history and frame context buffer. Example:

agent.clearHistory();

getHistory()

getHistory(): ModelMessage[]

Get the current conversation history.

history

ModelMessage[]

Array of conversation messages

Example:

const history = agent.getHistory();

setHistory()

setHistory(history: ModelMessage[]): void

Set the conversation history (useful for restoring sessions).

history

ModelMessage[]

required

Array of conversation messages to restore

Example:

const savedHistory = loadFromDatabase(userId);
agent.setHistory(savedHistory);

getFrameContext()

getFrameContext(): FrameContext[]

Get the current frame context buffer.

frameContext

FrameContext[]

Array of frame context objects with sequence, timestamp, triggerReason, and frameHash

Example:

const frames = agent.getFrameContext();
console.log(`${frames.length} frames in context`);

getSessionId()

getSessionId(): string

Get the session ID for this agent instance.

sessionId

string

The session identifier

Example:

const sessionId = agent.getSessionId();

Configuration

getConfig()

getConfig(): VideoAgentConfig

Get the current video agent configuration.

config

VideoAgentConfig

Configuration object with maxContextFrames

Example:

const config = agent.getConfig();
console.log(`Max frames: ${config.maxContextFrames}`);

updateConfig()

updateConfig(config: Partial<VideoAgentConfig>): void

Update the video agent configuration.

config

Partial<VideoAgentConfig>

required

Configuration updates to apply

Example:

agent.updateConfig({ maxContextFrames: 20 });

Tool Management

registerTools()

registerTools(tools: Record<string, Tool>): void

tools

Record<string, Tool>

required

AI SDK tools to register

Example:

import { tool } from "ai";
import { z } from "zod";

agent.registerTools({
  identifyObject: tool({
    description: "Identify an object in the current frame",
    parameters: z.object({
      objectName: z.string(),
    }),
    execute: async ({ objectName }) => {
      return { identified: true, confidence: 0.95 };
    },
  }),
});

Properties

All properties are read-only.

connected

get connected(): boolean

Whether the agent is currently connected to a WebSocket.

processing

get processing(): boolean

Whether the agent is currently processing input.

speaking

get speaking(): boolean

Whether the agent is currently speaking.

pendingSpeechChunks

get pendingSpeechChunks(): number

Number of speech chunks pending generation or playback.

destroyed

get destroyed(): boolean

Whether the agent has been permanently destroyed.

currentFrameSequence

get currentFrameSequence(): number

The sequence number of the most recently processed frame.

hasVisualContext

get hasVisualContext(): boolean

Whether the agent currently has visual context (a frame has been received). Example:

if (agent.hasVisualContext) {
  console.log("Agent can see");
}

Events

The VideoAgent extends EventEmitter and emits all the same events as VoiceAgent, plus additional video-specific events.

Video-Specific Events

frame_received

Emitted when a video frame is received and processed.

agent.on("frame_received", (data) => {
  console.log(`Frame ${data.sequence} received:`, data.dimensions);
  console.log(`Size: ${data.size} bytes, reason: ${data.triggerReason}`);
});

frame_requested

Emitted when the agent requests a frame capture from the client.

agent.on("frame_requested", ({ reason }) => {
  console.log(`Frame requested: ${reason}`);
});

client_ready

Emitted when the client signals it’s ready for video streaming.

agent.on("client_ready", (capabilities) => {
  console.log("Client ready with capabilities:", capabilities);
});

config_changed

Emitted when the agent configuration is updated.

agent.on("config_changed", (config) => {
  console.log("Config updated:", config);
});

Connection Events

Same as VoiceAgent: connected, disconnected, error, warning

Conversation Events

Same as VoiceAgent: text, text_delta, tool_call, tool_result Note: The text event may include hasImage: true when video frames are involved:

agent.on("text", ({ role, text, hasImage }) => {
  if (hasImage) {
    console.log(`${role} (with image): ${text}`);
  }
});

Speech Events

Same as VoiceAgent: speech_start, speech_complete, speech_interrupted, speech_chunk_queued, audio_chunk, audio

Transcription Events

Same as VoiceAgent: transcription, audio_received

History Events

Same as VoiceAgent: history_cleared, history_trimmed

Types

VideoAgentConfig

interface VideoAgentConfig {
  maxContextFrames: number;
}

Backend configuration for video processing.

VideoFrame

interface VideoFrame {
  type: "video_frame";
  sessionId: string;
  sequence: number;
  timestamp: number;
  triggerReason: FrameTriggerReason;
  previousFrameRef?: string;
  image: {
    data: string;        // Base64-encoded image
    format: string;      // Image format (e.g., "webp")
    width: number;       // Frame width in pixels
    height: number;      // Frame height in pixels
  };
}

Video frame data structure sent to/from the client.

FrameContext

interface FrameContext {
  sequence: number;
  timestamp: number;
  triggerReason: FrameTriggerReason;
  frameHash: string;
  description?: string;
}

Frame context for maintaining visual conversation history.

FrameTriggerReason

type FrameTriggerReason = "scene_change" | "user_request" | "timer" | "initial";

Reason why a frame was captured.

Complete Example

import { VideoAgent } from "@dimension-labs/voice-agent-ai";
import { openai } from "@ai-sdk/openai";
import { WebSocketServer } from "ws";

const wss = new WebSocketServer({ port: 8080 });

wss.on("connection", (socket) => {
  const agent = new VideoAgent({
    model: openai("gpt-4o"),
    transcriptionModel: openai.transcription("whisper-1"),
    speechModel: openai.speech("tts-1"),
    instructions: "You are a helpful assistant that can see and hear.",
    voice: "alloy",
    maxContextFrames: 15,
  });

  agent.handleSocket(socket);

  // Listen to events
  agent.on("frame_received", (data) => {
    console.log(`Frame ${data.sequence}: ${data.size} bytes`);
  });

  agent.on("text", ({ role, text, hasImage }) => {
    console.log(`${role}${hasImage ? " (with frame)" : ""}: ${text}`);
  });

  agent.on("disconnected", () => {
    console.log("Client disconnected");
    agent.destroy();
  });

  agent.on("error", (error) => {
    console.error("Agent error:", error);
  });
});

console.log("Video agent server listening on port 8080");

Agents

Core Managers

Types & Interfaces

Resources

​Overview

​Constructor

​VideoAgentOptions

​Example

​Methods

​Connection Management

​connect()

​handleSocket()

​disconnect()

​destroy()

​Input Methods

​sendText()

​sendAudio()

​sendAudioBuffer()

​Video Frame Methods

​sendFrame()

​requestFrameCapture()

​Audio Processing

​transcribeAudio()

​generateSpeechFromText()

​Speech Control

​interruptSpeech()

​interruptCurrentResponse()

​History & Context Management

​clearHistory()

​getHistory()

​setHistory()

​getFrameContext()

​getSessionId()

​Configuration

​getConfig()

​updateConfig()

​Tool Management

​registerTools()

​Properties

​connected

​processing

​speaking

​pendingSpeechChunks

​destroyed

​currentFrameSequence

​hasVisualContext

​Events

​Video-Specific Events

​frame_received

​frame_requested

​client_ready

​config_changed

​Connection Events

​Conversation Events

​Speech Events

​Transcription Events

​History Events

​Types

​VideoAgentConfig

​VideoFrame

​FrameContext

​FrameTriggerReason

​Complete Example

Build docs developers (and LLMs) love

Overview

Constructor

VideoAgentOptions

Example

Methods

Connection Management

connect()

handleSocket()

disconnect()

destroy()

Input Methods

sendText()

sendAudio()

sendAudioBuffer()

Video Frame Methods

sendFrame()

requestFrameCapture()

Audio Processing

transcribeAudio()

generateSpeechFromText()

Speech Control

interruptSpeech()

interruptCurrentResponse()

History & Context Management

clearHistory()

getHistory()

setHistory()

getFrameContext()

getSessionId()

Configuration

getConfig()

updateConfig()

Tool Management

registerTools()

Properties

connected

processing

speaking

pendingSpeechChunks

destroyed

currentFrameSequence

hasVisualContext

Events

Video-Specific Events

frame_received

frame_requested

client_ready

config_changed

Connection Events

Conversation Events

Speech Events

Transcription Events

History Events

Types

VideoAgentConfig

VideoFrame

FrameContext

FrameTriggerReason

Complete Example