Types & Interfaces

Core Interfaces

SpeechChunk

Represents a chunk of text to be converted to speech during streaming text-to-speech generation.

number

required

Unique identifier for this speech chunk in the generation queue

text

string

required

The text content to be converted to speech

audioPromise

Promise<Uint8Array | null>

Promise that resolves to the generated audio data, or null if generation fails

interface SpeechChunk {
  id: number;
  text: string;
  audioPromise?: Promise<Uint8Array | null>;
}

StreamingSpeechConfig

Configuration for streaming speech behavior and parallel TTS generation.

minChunkSize

number

required

Minimum characters before generating speech for a chunkDefault: 50

maxChunkSize

number

required

Maximum characters per chunk. Text will be split at sentence boundary before reaching this limit.Default: 200

parallelGeneration

boolean

required

Whether to enable parallel TTS generation for multiple chunksDefault: true

maxParallelRequests

number

required

Maximum number of parallel TTS requests allowed at onceDefault: 3

interface StreamingSpeechConfig {
  minChunkSize: number;
  maxChunkSize: number;
  parallelGeneration: boolean;
  maxParallelRequests: number;
}

// Default configuration
const DEFAULT_STREAMING_SPEECH_CONFIG: StreamingSpeechConfig = {
  minChunkSize: 50,
  maxChunkSize: 200,
  parallelGeneration: true,
  maxParallelRequests: 3,
};

Usage:

const agent = new VoiceAgent({
  model: openai("gpt-4o"),
  streamingSpeech: {
    minChunkSize: 40,
    maxChunkSize: 180,
    parallelGeneration: true,
    maxParallelRequests: 2,
  },
});

HistoryConfig

Configuration for conversation history memory management and automatic trimming.

maxMessages

number

required

Maximum number of messages to keep in history. When exceeded, oldest messages are trimmed in pairs (user + assistant). Set to 0 for unlimited.Default: 100

maxTotalChars

number

required

Maximum total character count across all messages. When exceeded, oldest messages are trimmed. Set to 0 for unlimited.Default: 0 (unlimited)

interface HistoryConfig {
  maxMessages: number;
  maxTotalChars: number;
}

// Default configuration
const DEFAULT_HISTORY_CONFIG: HistoryConfig = {
  maxMessages: 100,
  maxTotalChars: 0, // unlimited by default
};

Usage:

const agent = new VoiceAgent({
  model: openai("gpt-4o"),
  history: {
    maxMessages: 50,       // keep last 50 messages
    maxTotalChars: 100_000, // trim when total exceeds 100k chars
  },
});

StopWhenCondition

Type for defining when the LLM stream should stop during multi-step tool execution.

type StopWhenCondition = NonNullable<Parameters<typeof streamText>[0]["stopWhen"]>;

Common values:

import { stepCountIs } from "ai";

const agent = new VoiceAgent({
  model: openai("gpt-4o"),
  stopWhen: stepCountIs(5), // Stop after 5 tool execution steps
});

Video Agent Types

VideoFrame

Video frame data structure sent to/from the client for vision analysis.

type

'video_frame'

required

Message type identifier

sessionId

string

required

Unique session identifier for this video agent instance

sequence

number

required

Sequential frame number (increments with each frame)

timestamp

number

required

Unix timestamp (milliseconds) when the frame was captured

triggerReason

FrameTriggerReason

required

Reason why this frame was captured

previousFrameRef

string

Hash reference to the previous frame for context

image

object

required

Image data and metadata

data

string

required

Base64-encoded image data

format

string

required

Image format (e.g., “webp”, “jpeg”, “png”)

width

number

required

Image width in pixels

height

number

required

Image height in pixels

interface VideoFrame {
  type: "video_frame";
  sessionId: string;
  sequence: number;
  timestamp: number;
  triggerReason: FrameTriggerReason;
  previousFrameRef?: string;
  image: {
    data: string;
    format: string;
    width: number;
    height: number;
  };
}

AudioData

Audio data structure for WebSocket communication.

type

'audio'

required

Message type identifier

sessionId

string

required

Unique session identifier

data

string

required

Base64-encoded audio data

format

string

required

Audio format (e.g., “mp3”, “opus”, “wav”, “webm”)

sampleRate

number

Audio sample rate in Hz (e.g., 16000, 44100)

duration

number

Audio duration in milliseconds

timestamp

number

required

Unix timestamp (milliseconds) when the audio was recorded

interface AudioData {
  type: "audio";
  sessionId: string;
  data: string;
  format: string;
  sampleRate?: number;
  duration?: number;
  timestamp: number;
}

VideoAgentConfig

Backend configuration for video processing behavior.

maxContextFrames

number

required

Maximum frames to keep in context buffer for conversation historyDefault: 10

interface VideoAgentConfig {
  maxContextFrames: number;
}

const DEFAULT_VIDEO_AGENT_CONFIG: VideoAgentConfig = {
  maxContextFrames: 10,
};

Usage:

const videoAgent = new VideoAgent({
  model: openai("gpt-4o"), // Vision-enabled model required
  maxContextFrames: 15,
});

// Update config at runtime
videoAgent.updateConfig({ maxContextFrames: 20 });

FrameContext

Frame context for maintaining visual conversation history.

sequence

number

required

Frame sequence number

timestamp

number

required

Unix timestamp (milliseconds) of frame capture

triggerReason

FrameTriggerReason

required

Why this frame was captured

frameHash

string

required

Unique hash identifying this frame

description

string

Optional text description of the frame content

interface FrameContext {
  sequence: number;
  timestamp: number;
  triggerReason: FrameTriggerReason;
  frameHash: string;
  description?: string;
}

FrameTriggerReason

Enumeration of reasons why a frame was captured.

type FrameTriggerReason = "scene_change" | "user_request" | "timer" | "initial";

scene_change

string

Frame captured due to detected scene change in video

user_request

string

Frame captured because user sent a query or request

timer

string

Frame captured on a timer interval

initial

string

First frame captured when video stream starts

Constants

DEFAULT_MAX_AUDIO_SIZE

Default maximum audio input size in bytes.

const DEFAULT_MAX_AUDIO_SIZE = 10 * 1024 * 1024; // 10 MB

DEFAULT_MAX_FRAME_SIZE

Default maximum frame input size in bytes for video agents.

const DEFAULT_MAX_FRAME_SIZE = 5 * 1024 * 1024; // 5 MB

Events

Learn about all events emitted by agents

VoiceAgent

Voice agent class reference

VideoAgent

Video agent class reference

Agents

Core Managers

Resources

Types & Interfaces

Core Interfaces

SpeechChunk

StreamingSpeechConfig

HistoryConfig

StopWhenCondition

Video Agent Types

VideoFrame

AudioData

VideoAgentConfig

FrameContext

FrameTriggerReason

Constants

DEFAULT_MAX_AUDIO_SIZE

DEFAULT_MAX_FRAME_SIZE

Events

VoiceAgent

VideoAgent

Build docs developers (and LLMs) love

Agents

Core Managers

Types & Interfaces

Resources

​Core Interfaces

​SpeechChunk

​StreamingSpeechConfig

​HistoryConfig

​StopWhenCondition

​Video Agent Types

​VideoFrame

​AudioData

​VideoAgentConfig

​FrameContext

​FrameTriggerReason

​Constants

​DEFAULT_MAX_AUDIO_SIZE

​DEFAULT_MAX_FRAME_SIZE

​Related

Events

VoiceAgent

VideoAgent

Build docs developers (and LLMs) love

Core Interfaces

SpeechChunk

StreamingSpeechConfig

HistoryConfig

StopWhenCondition

Video Agent Types

VideoFrame

AudioData

VideoAgentConfig

FrameContext

FrameTriggerReason

Constants

DEFAULT_MAX_AUDIO_SIZE

DEFAULT_MAX_FRAME_SIZE

Related