Configuration

Overview

Both VoiceAgent and VideoAgent accept comprehensive configuration options for models, speech synthesis, transcription, history management, and more. This guide covers all available configuration fields with their types, defaults, and usage examples.

VoiceAgent Configuration

Required Options

model

LanguageModel

required

AI SDK language model for chat generation. Use any AI SDK provider model.

import { openai } from '@ai-sdk/openai';
import { anthropic } from '@ai-sdk/anthropic';

model: openai('gpt-4o')
model: anthropic('claude-3.5-sonnet')

Optional Models

transcriptionModel

TranscriptionModel

AI SDK transcription model for speech-to-text. Required for audio input processing.

transcriptionModel: openai.transcription('whisper-1')

speechModel

SpeechModel

AI SDK speech model for text-to-speech generation.

speechModel: openai.speech('gpt-4o-mini-tts')

System Configuration

instructions

string

default:"You are a helpful voice assistant."

System prompt that defines the agent’s behavior and personality.

instructions: `You are a helpful voice assistant. 
Keep responses concise and conversational since they will be spoken aloud.
Use tools when needed to provide accurate information.`

stopWhen

StopWhenCondition

default:"stepCountIs(5)"

Stopping condition for multi-step tool execution loops. Controls when the agent stops calling tools.

import { stepCountIs } from 'ai';

stopWhen: stepCountIs(3)  // Stop after 3 tool execution steps

tools

Record<string, Tool>

default:"{}"

AI SDK tools map for function calling. See Tool Integration for details.

import { tool } from 'ai';
import { z } from 'zod';

tools: {
  getWeather: tool({
    description: 'Get weather for a location',
    inputSchema: z.object({
      location: z.string()
    }),
    execute: async ({ location }) => ({ temperature: 72 })
  })
}

Speech Configuration

voice

string

default:"alloy"

TTS voice ID. For OpenAI: alloy, echo, fable, onyx, nova, shimmer.

voice: 'nova'

speechInstructions

string

Style instructions passed to the speech model for tone and delivery.

speechInstructions: 'Speak in a friendly, natural conversational tone.'

outputFormat

string

default:"mp3"

Audio output format: mp3, opus, aac, wav, pcm, etc.

outputFormat: 'opus'

streamingSpeech

Partial<StreamingSpeechConfig>

Fine-tune streaming TTS behavior for low-latency audio.StreamingSpeechConfig Fields:

Field	Type	Default	Description
`minChunkSize`	`number`	`50`	Minimum characters before generating speech
`maxChunkSize`	`number`	`200`	Maximum characters per chunk (splits at sentence boundary)
`parallelGeneration`	`boolean`	`true`	Generate TTS for next chunks while current plays
`maxParallelRequests`	`number`	`3`	Maximum concurrent TTS requests

streamingSpeech: {
  minChunkSize: 40,
  maxChunkSize: 180,
  parallelGeneration: true,
  maxParallelRequests: 2
}

History Management

history

Partial<HistoryConfig>

Configure conversation history limits to manage memory and context window usage.HistoryConfig Fields:

Field	Type	Default	Description
`maxMessages`	`number`	`100`	Max messages in history (0 = unlimited)
`maxTotalChars`	`number`	`0`	Max total characters across all messages (0 = unlimited)

history: {
  maxMessages: 50,        // Keep last 50 messages
  maxTotalChars: 100_000  // Or trim when total exceeds 100k chars
}

When limits are exceeded, oldest messages are trimmed in pairs (user + assistant) to preserve conversation turns. The agent emits a history_trimmed event with details.

Size Limits

maxAudioInputSize

number

default:"10485760"

Maximum audio input size in bytes (default: 10 MB). Rejects larger audio inputs.

maxAudioInputSize: 5 * 1024 * 1024  // 5 MB limit

WebSocket Configuration

endpoint

string

Default WebSocket URL for connect() method. Optional for text-only usage.

endpoint: 'ws://localhost:8080'
endpoint: process.env.VOICE_WS_ENDPOINT

VideoAgent Configuration

VideoAgent extends VoiceAgent with video-specific options:

maxFrameInputSize

number

default:"5242880"

Maximum frame input size in bytes (default: 5 MB).

maxFrameInputSize: 10 * 1024 * 1024  // 10 MB

maxContextFrames

number

default:"10"

Maximum number of frames to keep in context buffer for visual conversation history.

maxContextFrames: 20

sessionId

string

Session ID for this video agent instance. Auto-generated if not provided.

sessionId: 'custom-session-id'

Important: VideoAgent requires a vision-enabled model to process video frames:

OpenAI: gpt-4o, gpt-4o-mini
Anthropic: claude-3.5-sonnet, claude-3-opus
Google: gemini-1.5-pro, gemini-1.5-flash

Complete Example

import { VoiceAgent } from 'voice-agent-ai-sdk';
import { openai } from '@ai-sdk/openai';
import { stepCountIs, tool } from 'ai';
import { z } from 'zod';

const agent = new VoiceAgent({
  // === Required ===
  model: openai('gpt-4o'),
  
  // === Models ===
  transcriptionModel: openai.transcription('whisper-1'),
  speechModel: openai.speech('gpt-4o-mini-tts'),
  
  // === System ===
  instructions: `You are a helpful voice assistant.
  Keep responses concise and conversational.
  Use tools when needed to provide accurate information.`,
  stopWhen: stepCountIs(5),
  
  // === Tools ===
  tools: {
    getWeather: tool({
      description: 'Get weather in a location',
      inputSchema: z.object({ location: z.string() }),
      execute: async ({ location }) => ({
        location,
        temperature: 72,
        conditions: 'sunny'
      })
    })
  },
  
  // === Speech ===
  voice: 'alloy',
  speechInstructions: 'Speak naturally and conversationally.',
  outputFormat: 'mp3',
  streamingSpeech: {
    minChunkSize: 40,
    maxChunkSize: 180,
    parallelGeneration: true,
    maxParallelRequests: 2
  },
  
  // === History Management ===
  history: {
    maxMessages: 50,
    maxTotalChars: 100_000
  },
  
  // === Limits ===
  maxAudioInputSize: 5 * 1024 * 1024,  // 5 MB
  
  // === WebSocket ===
  endpoint: process.env.VOICE_WS_ENDPOINT
});

Runtime Configuration Updates

Updating Tools

You can add or update tools after initialization:

agent.registerTools({
  getTime: tool({
    description: 'Get current time',
    inputSchema: z.object({}),
    execute: async () => ({ time: new Date().toISOString() })
  })
});

VideoAgent Configuration

VideoAgent supports runtime config updates:

const videoAgent = new VideoAgent({ /* ... */ });

// Get current config
const config = videoAgent.getConfig();
console.log(config.maxContextFrames);

// Update config
videoAgent.updateConfig({
  maxContextFrames: 20
});

// Listen for config changes
videoAgent.on('config_changed', (newConfig) => {
  console.log('Config updated:', newConfig);
});

Environment Variables Pattern

A common pattern is to use environment variables for sensitive configuration:

import 'dotenv/config';
import { openai } from '@ai-sdk/openai';

const agent = new VoiceAgent({
  model: openai(process.env.OPENAI_MODEL || 'gpt-4o'),
  transcriptionModel: openai.transcription('whisper-1'),
  speechModel: openai.speech('gpt-4o-mini-tts'),
  endpoint: process.env.VOICE_WS_ENDPOINT,
  // ... other options
});

Get Started

Core Concepts

Guides

Examples

Overview

VoiceAgent Configuration

Required Options

Optional Models

System Configuration

Speech Configuration

History Management

Size Limits

WebSocket Configuration

VideoAgent Configuration

Complete Example

Runtime Configuration Updates

Updating Tools

VideoAgent Configuration

Environment Variables Pattern

Next Steps

Tool Integration

Browser Client

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Examples

​Overview

​VoiceAgent Configuration

​Required Options

​Optional Models

​System Configuration

​Speech Configuration

​History Management

​Size Limits

​WebSocket Configuration

​VideoAgent Configuration

​Complete Example

​Runtime Configuration Updates

​Updating Tools

​VideoAgent Configuration

​Environment Variables Pattern

​Next Steps

Tool Integration

Browser Client

Build docs developers (and LLMs) love

Overview

VoiceAgent Configuration

Required Options

Optional Models

System Configuration

Speech Configuration

History Management

Size Limits

WebSocket Configuration

VideoAgent Configuration

Complete Example

Runtime Configuration Updates

Updating Tools

VideoAgent Configuration

Environment Variables Pattern

Next Steps