The AudioBridge class handles real-time bidirectional audio streaming between Twilio phone calls and AI models (Gemini Live API or OpenAI Realtime API).
Overview
AudioBridge manages:
- Audio format conversion between Twilio (mulaw 8kHz) and AI models (PCM 16/24kHz)
- Bidirectional audio streaming with minimal latency
- Real-time transcription (via Gemini, OpenAI, or Whisper STT)
- Conversation memory and intent detection via
ConversationBrain
- Audio buffering and silence detection
Constructor
from agenticai.core.audio_bridge import AudioBridge
from agenticai.twilio.websocket import TwilioMediaStreamHandler
from agenticai.gemini.realtime_handler import GeminiRealtimeHandler
bridge = AudioBridge(
twilio_handler=twilio_handler,
gemini_handler=gemini_handler,
telegram_chat_id="123456789",
call_id="call-123",
gemini_api_key="your-gemini-api-key",
whisper_api_key="your-openai-api-key",
whisper_enabled=False,
use_openai=False
)
twilio_handler
TwilioMediaStreamHandler
required
Twilio WebSocket handler for the media stream.
gemini_handler
GeminiRealtimeHandler | OpenAIRealtimeHandler
required
Real-time AI handler (works with both Gemini and OpenAI).
telegram_client
TelegramDirectClient
default:"None"
Legacy parameter, not used in current implementation.
Telegram chat ID for ClawdBot agent integration.
Unique identifier for the call session.
Gemini API key for the ConversationBrain.
OpenAI API key for Whisper STT (optional).
Whether to use Whisper for accurate speech-to-text instead of the AI model’s built-in STT.
Whether using OpenAI Realtime API (affects audio format conversion: 24kHz vs 16kHz).
Properties
is_running
@property
def is_running(self) -> bool
Returns True if the audio bridge is currently active.
transcripts
@property
def transcripts(self) -> list[TranscriptEntry]
Returns a list of all collected transcript entries.
List of transcript entries with speaker, text, timestamp, and finality.
brain
@property
def brain(self) -> ConversationBrain
Returns the ConversationBrain instance for accessing conversation memory and intent analysis.
The conversation brain instance.
Methods
start
Starts the audio bridge, initializes callbacks, and begins audio processing tasks.
Example:
bridge = AudioBridge(
twilio_handler=twilio_handler,
gemini_handler=gemini_handler,
call_id="call-123"
)
await bridge.start()
print("Audio bridge is now streaming...")
stop
Stops the audio bridge and cleans up all resources.
Example:
await bridge.stop()
print("Audio bridge stopped")
get_full_transcript
def get_full_transcript(self) -> str
Returns the complete call transcript as a formatted string.
Multi-line transcript with speaker labels.
Example:
transcript = bridge.get_full_transcript()
print(transcript)
# Output:
# User: Hello, can you help me?
# Assistant: Of course! How can I assist you today?
# User: I need to send an email
# Assistant: I'll help you send that email.
get_conversation_summary
def get_conversation_summary(self) -> str
Returns the brain’s conversation summary including detected intents and commands.
Conversation summary with metadata and context.
Example:
summary = bridge.get_conversation_summary()
print(summary)
Audio Processing Pipeline
Twilio → AI Model (User Speech)
- Receive: Twilio sends mulaw 8kHz audio chunks
- Convert: Mulaw → PCM 16kHz (Gemini) or PCM 24kHz (OpenAI)
- Buffer: Accumulate ~50ms chunks for better STT accuracy
- Send: Forward to AI model’s WebSocket
- Transcribe: Either via AI model’s STT or Whisper
- Analyze: ConversationBrain detects intent and extracts commands
AI Model → Twilio (AI Speech)
- Receive: PCM 16/24kHz audio from AI model
- Convert: PCM → mulaw 8kHz for Twilio
- Send: Stream to phone call via Twilio WebSocket
Transcript Processing
With Gemini/OpenAI Built-in STT
# Word-by-word fragments are buffered
brain.add_user_transcript("Send ")
brain.add_user_transcript("an ")
brain.add_user_transcript("email")
# Complete turn is flushed when user stops speaking
await brain.flush_user_turn()
# → "Send an email" analyzed for intent
With Whisper STT
# Audio buffered until silence detected
silence_detector.process(audio_chunk)
# Complete utterance transcribed at once
transcript = await whisper.transcribe(audio_buffer)
# → "Send an email to John" returned as complete phrase
await brain.flush_user_turn()
Data Classes
TranscriptEntry
@dataclass
class TranscriptEntry:
speaker: str # "user" or "assistant"
text: str # Transcript text
timestamp: datetime # When spoken
is_final: bool # Whether this is a final transcript
Complete Example
import asyncio
from agenticai.core.audio_bridge import AudioBridge
from agenticai.core.call_manager import CallManager
from agenticai.twilio.websocket import TwilioMediaStreamHandler
from agenticai.gemini.realtime_handler import GeminiRealtimeHandler
async def run_call_with_bridge(websocket, call_id: str):
# Initialize handlers
twilio_handler = TwilioMediaStreamHandler(websocket)
gemini_handler = GeminiRealtimeHandler(
api_key="your-api-key",
model="models/gemini-2.5-flash-native-audio-preview-12-2025",
voice="Puck",
system_instruction="You are a helpful AI assistant."
)
await gemini_handler.connect(initial_prompt="Greet the caller.")
# Create audio bridge
bridge = AudioBridge(
twilio_handler=twilio_handler,
gemini_handler=gemini_handler,
call_id=call_id,
gemini_api_key="your-api-key",
telegram_chat_id="123456789",
whisper_enabled=True,
whisper_api_key="your-openai-api-key"
)
# Start streaming
await bridge.start()
print("Audio bridge running...")
# Wait for call to complete
while bridge.is_running:
await asyncio.sleep(0.5)
# Get results
transcript = bridge.get_full_transcript()
summary = bridge.get_conversation_summary()
print("Call completed!")
print(f"Transcript:\n{transcript}")
print(f"\nSummary:\n{summary}")
# Cleanup
await bridge.stop()
await gemini_handler.disconnect()
| Source | Format | Sample Rate | Bit Depth |
|---|
| Twilio | Mulaw | 8 kHz | 8-bit |
| Gemini | PCM | 16 kHz | 16-bit |
| OpenAI | PCM | 24 kHz | 16-bit |
| Whisper | PCM | 16 kHz | 16-bit |
Latency Optimization
# Smaller buffer = lower latency (50ms)
self._min_chunk_size = 2400 if use_openai else 1600
# Larger buffer = better STT accuracy (100ms)
self._min_chunk_size = 4800 if use_openai else 3200
Silence Detection
from agenticai.audio.whisper_stt import SilenceDetector
silence_detector = SilenceDetector(
silence_threshold=500, # Amplitude threshold
silence_duration_ms=500, # Required silence duration
sample_rate=16000
)