Skip to main content
The AudioBridge class handles real-time bidirectional audio streaming between Twilio phone calls and AI models (Gemini Live API or OpenAI Realtime API).

Overview

AudioBridge manages:
  • Audio format conversion between Twilio (mulaw 8kHz) and AI models (PCM 16/24kHz)
  • Bidirectional audio streaming with minimal latency
  • Real-time transcription (via Gemini, OpenAI, or Whisper STT)
  • Conversation memory and intent detection via ConversationBrain
  • Audio buffering and silence detection

Constructor

from agenticai.core.audio_bridge import AudioBridge
from agenticai.twilio.websocket import TwilioMediaStreamHandler
from agenticai.gemini.realtime_handler import GeminiRealtimeHandler

bridge = AudioBridge(
    twilio_handler=twilio_handler,
    gemini_handler=gemini_handler,
    telegram_chat_id="123456789",
    call_id="call-123",
    gemini_api_key="your-gemini-api-key",
    whisper_api_key="your-openai-api-key",
    whisper_enabled=False,
    use_openai=False
)
twilio_handler
TwilioMediaStreamHandler
required
Twilio WebSocket handler for the media stream.
gemini_handler
GeminiRealtimeHandler | OpenAIRealtimeHandler
required
Real-time AI handler (works with both Gemini and OpenAI).
telegram_client
TelegramDirectClient
default:"None"
Legacy parameter, not used in current implementation.
telegram_chat_id
str
default:""
Telegram chat ID for ClawdBot agent integration.
call_id
str
default:""
Unique identifier for the call session.
gemini_api_key
str
default:""
Gemini API key for the ConversationBrain.
whisper_api_key
str
default:""
OpenAI API key for Whisper STT (optional).
whisper_enabled
bool
default:"False"
Whether to use Whisper for accurate speech-to-text instead of the AI model’s built-in STT.
use_openai
bool
default:"False"
Whether using OpenAI Realtime API (affects audio format conversion: 24kHz vs 16kHz).

Properties

is_running

@property
def is_running(self) -> bool
Returns True if the audio bridge is currently active.
is_running
bool
Bridge running status.

transcripts

@property
def transcripts(self) -> list[TranscriptEntry]
Returns a list of all collected transcript entries.
transcripts
list[TranscriptEntry]
List of transcript entries with speaker, text, timestamp, and finality.

brain

@property
def brain(self) -> ConversationBrain
Returns the ConversationBrain instance for accessing conversation memory and intent analysis.
brain
ConversationBrain
The conversation brain instance.

Methods

start

async def start(self)
Starts the audio bridge, initializes callbacks, and begins audio processing tasks. Example:
bridge = AudioBridge(
    twilio_handler=twilio_handler,
    gemini_handler=gemini_handler,
    call_id="call-123"
)

await bridge.start()
print("Audio bridge is now streaming...")

stop

async def stop(self)
Stops the audio bridge and cleans up all resources. Example:
await bridge.stop()
print("Audio bridge stopped")

get_full_transcript

def get_full_transcript(self) -> str
Returns the complete call transcript as a formatted string.
transcript
str
Multi-line transcript with speaker labels.
Example:
transcript = bridge.get_full_transcript()
print(transcript)
# Output:
# User: Hello, can you help me?
# Assistant: Of course! How can I assist you today?
# User: I need to send an email
# Assistant: I'll help you send that email.

get_conversation_summary

def get_conversation_summary(self) -> str
Returns the brain’s conversation summary including detected intents and commands.
summary
str
Conversation summary with metadata and context.
Example:
summary = bridge.get_conversation_summary()
print(summary)

Audio Processing Pipeline

Twilio → AI Model (User Speech)

  1. Receive: Twilio sends mulaw 8kHz audio chunks
  2. Convert: Mulaw → PCM 16kHz (Gemini) or PCM 24kHz (OpenAI)
  3. Buffer: Accumulate ~50ms chunks for better STT accuracy
  4. Send: Forward to AI model’s WebSocket
  5. Transcribe: Either via AI model’s STT or Whisper
  6. Analyze: ConversationBrain detects intent and extracts commands

AI Model → Twilio (AI Speech)

  1. Receive: PCM 16/24kHz audio from AI model
  2. Convert: PCM → mulaw 8kHz for Twilio
  3. Send: Stream to phone call via Twilio WebSocket

Transcript Processing

With Gemini/OpenAI Built-in STT

# Word-by-word fragments are buffered
brain.add_user_transcript("Send ")
brain.add_user_transcript("an ")
brain.add_user_transcript("email")

# Complete turn is flushed when user stops speaking
await brain.flush_user_turn()
# → "Send an email" analyzed for intent

With Whisper STT

# Audio buffered until silence detected
silence_detector.process(audio_chunk)

# Complete utterance transcribed at once
transcript = await whisper.transcribe(audio_buffer)
# → "Send an email to John" returned as complete phrase

await brain.flush_user_turn()

Data Classes

TranscriptEntry

@dataclass
class TranscriptEntry:
    speaker: str          # "user" or "assistant"
    text: str             # Transcript text
    timestamp: datetime   # When spoken
    is_final: bool        # Whether this is a final transcript

Complete Example

import asyncio
from agenticai.core.audio_bridge import AudioBridge
from agenticai.core.call_manager import CallManager
from agenticai.twilio.websocket import TwilioMediaStreamHandler
from agenticai.gemini.realtime_handler import GeminiRealtimeHandler

async def run_call_with_bridge(websocket, call_id: str):
    # Initialize handlers
    twilio_handler = TwilioMediaStreamHandler(websocket)
    
    gemini_handler = GeminiRealtimeHandler(
        api_key="your-api-key",
        model="models/gemini-2.5-flash-native-audio-preview-12-2025",
        voice="Puck",
        system_instruction="You are a helpful AI assistant."
    )
    
    await gemini_handler.connect(initial_prompt="Greet the caller.")
    
    # Create audio bridge
    bridge = AudioBridge(
        twilio_handler=twilio_handler,
        gemini_handler=gemini_handler,
        call_id=call_id,
        gemini_api_key="your-api-key",
        telegram_chat_id="123456789",
        whisper_enabled=True,
        whisper_api_key="your-openai-api-key"
    )
    
    # Start streaming
    await bridge.start()
    print("Audio bridge running...")
    
    # Wait for call to complete
    while bridge.is_running:
        await asyncio.sleep(0.5)
    
    # Get results
    transcript = bridge.get_full_transcript()
    summary = bridge.get_conversation_summary()
    
    print("Call completed!")
    print(f"Transcript:\n{transcript}")
    print(f"\nSummary:\n{summary}")
    
    # Cleanup
    await bridge.stop()
    await gemini_handler.disconnect()

Audio Format Reference

SourceFormatSample RateBit Depth
TwilioMulaw8 kHz8-bit
GeminiPCM16 kHz16-bit
OpenAIPCM24 kHz16-bit
WhisperPCM16 kHz16-bit

Performance Tuning

Latency Optimization

# Smaller buffer = lower latency (50ms)
self._min_chunk_size = 2400 if use_openai else 1600

# Larger buffer = better STT accuracy (100ms)
self._min_chunk_size = 4800 if use_openai else 3200

Silence Detection

from agenticai.audio.whisper_stt import SilenceDetector

silence_detector = SilenceDetector(
    silence_threshold=500,      # Amplitude threshold
    silence_duration_ms=500,    # Required silence duration
    sample_rate=16000
)

Build docs developers (and LLMs) love