Skip to main content

Core Interfaces

SpeechChunk

Represents a chunk of text to be converted to speech during streaming text-to-speech generation.
id
number
required
Unique identifier for this speech chunk in the generation queue
text
string
required
The text content to be converted to speech
audioPromise
Promise<Uint8Array | null>
Promise that resolves to the generated audio data, or null if generation fails
interface SpeechChunk {
  id: number;
  text: string;
  audioPromise?: Promise<Uint8Array | null>;
}

StreamingSpeechConfig

Configuration for streaming speech behavior and parallel TTS generation.
minChunkSize
number
required
Minimum characters before generating speech for a chunkDefault: 50
maxChunkSize
number
required
Maximum characters per chunk. Text will be split at sentence boundary before reaching this limit.Default: 200
parallelGeneration
boolean
required
Whether to enable parallel TTS generation for multiple chunksDefault: true
maxParallelRequests
number
required
Maximum number of parallel TTS requests allowed at onceDefault: 3
interface StreamingSpeechConfig {
  minChunkSize: number;
  maxChunkSize: number;
  parallelGeneration: boolean;
  maxParallelRequests: number;
}

// Default configuration
const DEFAULT_STREAMING_SPEECH_CONFIG: StreamingSpeechConfig = {
  minChunkSize: 50,
  maxChunkSize: 200,
  parallelGeneration: true,
  maxParallelRequests: 3,
};
Usage:
const agent = new VoiceAgent({
  model: openai("gpt-4o"),
  streamingSpeech: {
    minChunkSize: 40,
    maxChunkSize: 180,
    parallelGeneration: true,
    maxParallelRequests: 2,
  },
});

HistoryConfig

Configuration for conversation history memory management and automatic trimming.
maxMessages
number
required
Maximum number of messages to keep in history. When exceeded, oldest messages are trimmed in pairs (user + assistant). Set to 0 for unlimited.Default: 100
maxTotalChars
number
required
Maximum total character count across all messages. When exceeded, oldest messages are trimmed. Set to 0 for unlimited.Default: 0 (unlimited)
interface HistoryConfig {
  maxMessages: number;
  maxTotalChars: number;
}

// Default configuration
const DEFAULT_HISTORY_CONFIG: HistoryConfig = {
  maxMessages: 100,
  maxTotalChars: 0, // unlimited by default
};
Usage:
const agent = new VoiceAgent({
  model: openai("gpt-4o"),
  history: {
    maxMessages: 50,       // keep last 50 messages
    maxTotalChars: 100_000, // trim when total exceeds 100k chars
  },
});

StopWhenCondition

Type for defining when the LLM stream should stop during multi-step tool execution.
type StopWhenCondition = NonNullable<Parameters<typeof streamText>[0]["stopWhen"]>;
Common values:
import { stepCountIs } from "ai";

const agent = new VoiceAgent({
  model: openai("gpt-4o"),
  stopWhen: stepCountIs(5), // Stop after 5 tool execution steps
});

Video Agent Types

VideoFrame

Video frame data structure sent to/from the client for vision analysis.
type
'video_frame'
required
Message type identifier
sessionId
string
required
Unique session identifier for this video agent instance
sequence
number
required
Sequential frame number (increments with each frame)
timestamp
number
required
Unix timestamp (milliseconds) when the frame was captured
triggerReason
FrameTriggerReason
required
Reason why this frame was captured
previousFrameRef
string
Hash reference to the previous frame for context
image
object
required
Image data and metadata
data
string
required
Base64-encoded image data
format
string
required
Image format (e.g., “webp”, “jpeg”, “png”)
width
number
required
Image width in pixels
height
number
required
Image height in pixels
interface VideoFrame {
  type: "video_frame";
  sessionId: string;
  sequence: number;
  timestamp: number;
  triggerReason: FrameTriggerReason;
  previousFrameRef?: string;
  image: {
    data: string;
    format: string;
    width: number;
    height: number;
  };
}

AudioData

Audio data structure for WebSocket communication.
type
'audio'
required
Message type identifier
sessionId
string
required
Unique session identifier
data
string
required
Base64-encoded audio data
format
string
required
Audio format (e.g., “mp3”, “opus”, “wav”, “webm”)
sampleRate
number
Audio sample rate in Hz (e.g., 16000, 44100)
duration
number
Audio duration in milliseconds
timestamp
number
required
Unix timestamp (milliseconds) when the audio was recorded
interface AudioData {
  type: "audio";
  sessionId: string;
  data: string;
  format: string;
  sampleRate?: number;
  duration?: number;
  timestamp: number;
}

VideoAgentConfig

Backend configuration for video processing behavior.
maxContextFrames
number
required
Maximum frames to keep in context buffer for conversation historyDefault: 10
interface VideoAgentConfig {
  maxContextFrames: number;
}

const DEFAULT_VIDEO_AGENT_CONFIG: VideoAgentConfig = {
  maxContextFrames: 10,
};
Usage:
const videoAgent = new VideoAgent({
  model: openai("gpt-4o"), // Vision-enabled model required
  maxContextFrames: 15,
});

// Update config at runtime
videoAgent.updateConfig({ maxContextFrames: 20 });

FrameContext

Frame context for maintaining visual conversation history.
sequence
number
required
Frame sequence number
timestamp
number
required
Unix timestamp (milliseconds) of frame capture
triggerReason
FrameTriggerReason
required
Why this frame was captured
frameHash
string
required
Unique hash identifying this frame
description
string
Optional text description of the frame content
interface FrameContext {
  sequence: number;
  timestamp: number;
  triggerReason: FrameTriggerReason;
  frameHash: string;
  description?: string;
}

FrameTriggerReason

Enumeration of reasons why a frame was captured.
type FrameTriggerReason = "scene_change" | "user_request" | "timer" | "initial";
scene_change
string
Frame captured due to detected scene change in video
user_request
string
Frame captured because user sent a query or request
timer
string
Frame captured on a timer interval
initial
string
First frame captured when video stream starts

Constants

DEFAULT_MAX_AUDIO_SIZE

Default maximum audio input size in bytes.
const DEFAULT_MAX_AUDIO_SIZE = 10 * 1024 * 1024; // 10 MB

DEFAULT_MAX_FRAME_SIZE

Default maximum frame input size in bytes for video agents.
const DEFAULT_MAX_FRAME_SIZE = 5 * 1024 * 1024; // 5 MB

Events

Learn about all events emitted by agents

VoiceAgent

Voice agent class reference

VideoAgent

Video agent class reference

Build docs developers (and LLMs) love