What is Voice Agent AI SDK?
Voice Agent AI SDK is an npm package that combines:- Streaming text generation via AI SDK with multi-step tool calling
- Real-time speech synthesis with chunked streaming for low time-to-first-audio
- Audio transcription using models like OpenAI Whisper
- WebSocket transport for bidirectional voice communication
- Barge-in support for natural conversation interruptions
This SDK is published to npm as
voice-agent-ai-sdk and can be used in any Node.js 20+ environment.Who is this for?
This SDK is ideal for developers building:- Voice-enabled chatbots and virtual assistants
- Customer service automation with voice support
- Interactive voice response (IVR) systems
- Real-time AI voice applications with tool calling
- Server-side text-to-speech streaming services
Key Capabilities
Streaming Text & Speech
Text is split at sentence boundaries and converted to speech in parallel as the LLM streams, providing ultra-low latency to first audio.
Tool Calling
Full support for AI SDK tools with multi-step execution. Define functions the agent can call to fetch data, perform actions, or integrate with external APIs.
Barge-in & Interruption
User speech automatically cancels in-flight LLM streams and pending TTS, saving tokens and reducing latency for natural conversation flow.
Memory Management
Configurable sliding-window conversation history with
maxMessages and maxTotalChars limits, plus audio input size constraints.WebSocket Transport
Built-in WebSocket protocol with stream, tool, and speech lifecycle events. Also works without WebSocket for text-only or server-side usage.
Graceful Lifecycle
disconnect() aborts all in-flight work cleanly; destroy() permanently releases every resource. Serial request queue prevents race conditions.Architecture
The SDK is designed around a singleVoiceAgent instance per user session:
Use Cases
Text-Only Mode
UsesendText() directly for server-side applications, chatbots, or any scenario where you don’t need real-time audio streaming.
WebSocket Voice Mode
Connect to a WebSocket endpoint withconnect() or handleSocket() to enable bidirectional voice communication with automatic transcription and speech synthesis.
Hybrid Applications
Combine both modes: use text input for some interactions and audio for others. The agent seamlessly handles both input types.Next Steps
Installation
Install the SDK and set up your development environment
Quickstart
Build your first voice agent in 5 minutes