Overview
The Voice Agent enables real-time voice conversations with AI characters using OpenAI’s GPT-4o Realtime API. It supports bidirectional audio streaming over WebSockets with character personalities and tool integrations.Architecture
The voice agent uses:- OpenAI Realtime API (
gpt-4o-realtime-preview) for voice processing - WebSockets for bidirectional audio streaming
- Starlette as the web framework
- LangChain OpenAI Voice wrapper for agent orchestration
Server Implementation
The voice server is built with Starlette and provides WebSocket endpoints:server/src/server/app.py
Character-Specific Endpoints
Create dedicated endpoints for different characters:server/src/server/app.py
Voice Instructions
Character personalities are injected into the voice agent prompt:server/src/server/prompt.py
Health Monitoring
The server includes health checks for load balancing:server/src/server/app.py
Voice Options
OpenAI Realtime API supports multiple voices:alloy
Neutral, balanced tone
ash
Deep, calm voice
ballad
Warm, friendly tone
coral
Bright, energetic voice
echo
Clear, professional tone
sage
Wise, measured voice
shimmer
Soft, gentle tone
verse
Dynamic, expressive voice
Configuration
Set up your voice agent server:.env
Running the Server
Start the voice agent server:Terminal
http://0.0.0.0:3000 with WebSocket endpoints ready.
Client Integration
WebSocket Connection
Connect from a web browser:client.js
Audio Format
The agent expects audio in PCM16 format at 24kHz sample rate:Tool Integration
Voice agents can use the same tools as chat agents:server/tools.py
Load Balancing
The server limits concurrent connections to maintain performance:Deployment
Docker
Create a Dockerfile for the voice server:Dockerfile
Terminal
Cloud Deployment
For production deployment with auto-scaling:Best Practices
Latency Optimization
- Use WebSocket compression
- Deploy close to users geographically
- Stream audio in small chunks (100ms)
- Minimize tool execution time
Connection Management
- Implement reconnection logic
- Handle network interruptions gracefully
- Monitor active connections
- Set appropriate timeouts
Audio Quality
- Use 24kHz sample rate minimum
- Implement noise reduction
- Handle audio buffering properly
- Test with various microphones
Security
- Use WSS (WebSocket Secure) in production
- Implement authentication
- Rate limit connections
- Validate audio data size
Troubleshooting
Connection fails immediately
Connection fails immediately
- Verify
OPENAI_API_KEYis set correctly - Check WebSocket URL (ws:// for local, wss:// for production)
- Ensure firewall allows WebSocket connections
- Review browser console for CORS errors
No audio output
No audio output
- Check browser microphone permissions
- Verify audio format is PCM16 at 24kHz
- Test audio playback independently
- Review WebSocket message format
High latency
High latency
- Reduce audio chunk size
- Check network bandwidth
- Monitor server CPU/memory usage
- Consider deploying closer to users
Server capacity errors
Server capacity errors
- Scale to more instances
- Increase
MAX_CONNECTIONS_PER_INSTANCE - Implement connection queuing
- Add load balancer
Example Use Cases
Crypto Advisor
Voice agent that explains DeFi concepts and executes trades
Podcast Assistant
Query podcast transcripts via voice and get audio summaries
Customer Support
Real-time voice support with blockchain transaction help
Next Steps
Chat Agent
Build text-based conversational agents
Twitter Automation
Create autonomous social media bots
Custom Tools
Add custom capabilities to your voice agent
Twitter Automation
Set up autonomous social media agents