Overview
Agentic AI is a real-time voice agent system that combines OpenAI Realtime API (or Gemini Live), Twilio for telephony, and ClawdBot for command execution. The architecture is designed for low-latency bidirectional audio streaming with intelligent intent understanding.Architecture Diagram
Core Components
Call Manager
The CallManager (call_manager.py:41) orchestrates the entire call lifecycle:
- Initiates outbound calls via Twilio REST API
- Handles incoming call registration
- Manages active call sessions and state
- Coordinates between Twilio, audio processing, and AI handlers
- Tracks call metadata and duration
The CallManager maintains a registry of active sessions and pending calls, allowing it to route media streams to the correct audio bridge when Twilio connects.
initiate_call()- Start an outbound call (call_manager.py:126)register_incoming_call()- Handle incoming calls (call_manager.py:189)handle_media_stream()- Route WebSocket to audio bridge (call_manager.py:284)
Audio Bridge
The AudioBridge (audio_bridge.py:32) is the heart of real-time audio processing:
- Receives audio from Twilio WebSocket
- Converts audio formats using AudioConverter
- Routes audio to/from OpenAI or Gemini handlers
- Manages transcript collection
- Feeds transcripts to ConversationBrain for analysis
The AudioBridge buffers small audio chunks (~50ms) before sending to improve STT accuracy and reduce API calls.
Conversation Brain
The ConversationBrain (conversation_brain.py:76) provides intelligence:
- Analyzes user intent from transcripts
- Distinguishes actionable commands from casual conversation
- Routes commands to ClawdBot for execution
- Maintains conversation memory and context
- Feeds ClawdBot responses back to the AI to speak
Audio Converter
The AudioConverter (audio/converter.py:25) handles all audio format transformations:
Format Support:
- Twilio: mulaw 8kHz mono
- Gemini Input: PCM 16-bit 16kHz mono
- Gemini Output: PCM 16-bit 24kHz mono
- OpenAI: PCM 16-bit 24kHz mono (both directions)
- High-quality resampling using
soxr - Efficient mulaw ↔ PCM conversion
- Reusable resampler instances for performance
Data Flow
Outbound Call Flow
Intent Processing Flow
Configuration
The system uses a centralizedconfig.yaml loaded by Config class (core/config.py):
Session Management
Each call is tracked as a CallSession (call_manager.py:28):
initiating- Call being createdringing- Phone is ringingin-progress- Active conversationcompleted/failed- Call ended
WebSocket Protocols
Twilio Media Streams
Twilio sends/receives media via WebSocket messages:TwilioMediaStreamHandler (twilio/websocket.py).
OpenClaw Gateway
Communication with ClawdBot uses JSON-RPC 2.0:GatewayClient (gateway/client.py:16).
See ClawdBot Integration for details.
Scalability Considerations
The current architecture is designed for single-instance deployments handling concurrent calls on one server.
- 1 WebSocket to Twilio
- 1 WebSocket to Realtime API (OpenAI/Gemini)
- Audio processing: ~2-5% CPU per call
- Memory: ~50-100 MB per active call
- Vertical: Increase server resources for more concurrent calls
- Horizontal: Deploy multiple instances with load balancing (requires session affinity)
- Queue-based: Use message queue for intent processing to reduce realtime load
Error Handling
The system implements graceful degradation:- WebSocket disconnections: Automatic reconnection with exponential backoff
- API failures: Fallback to error messages spoken to user
- Intent analysis errors: Default to treating as actionable (safer)
- ClawdBot timeout: User notified of delay, continues conversation
Next Steps
Conversation Brain
Deep dive into intent understanding and command routing
Audio Pipeline
Learn about audio format conversion and processing
ClawdBot Integration
How commands are executed via OpenClaw Gateway
Getting Started
Set up your own Agentic AI instance