Skip to main content

WebSocket Endpoint

/ws
WebSocket
Real-time audio streaming endpoint for Telnyx telephony integration

Endpoint

WS /ws
WSS /ws (production with TLS)

Description

WebSocket endpoint that receives real-time audio streams from Telnyx during active calls. The endpoint performs live transcription using Deepgram, analyzes audio for distress signals, and maintains real-time call state.

Connection

Telnyx initiates the WebSocket connection when the system calls the streaming_start action on a call.

Connection Example (JavaScript)

const ws = new WebSocket('wss://your-domain.com/ws');

ws.onopen = () => {
  console.log('WebSocket connected');
};

ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  console.log('Event:', data.event);
};

ws.onerror = (error) => {
  console.error('WebSocket error:', error);
};

ws.onclose = () => {
  console.log('WebSocket closed');
};

Message Protocol

All messages are JSON-formatted text frames.

Client → Server Messages

Event: connected

Initial connection acknowledgment from Telnyx.
{
  "event": "connected"
}

Event: start

Streaming session started. Contains call metadata and audio configuration.
event
string
required
Always "start"
start
object
required
Stream configuration
start.call_control_id
string
required
Telnyx call control ID
start.sampling_rate
number
required
Audio sample rate in Hz (typically 8000)

Example

{
  "event": "start",
  "start": {
    "call_control_id": "v3:xyz789-abc123",
    "sampling_rate": 8000
  }
}
Server Processing:
  1. Resolves call_control_id to call_session_id using CALL_CONTROL_TO_CALL_ID map
  2. Initializes call-specific state in LIVE_SIGNALS dictionary
  3. Starts Deepgram real-time transcription session
  4. Prepares audio processing buffers

Event: media

Audio data packet (sent repeatedly during call).
event
string
required
Always "media"
media
object
required
Audio payload
media.payload
string
required
Base64-encoded audio data (PCMU/μ-law format, typically 80 or 160 bytes)

Example

{
  "event": "media",
  "media": {
    "payload": "/v/9//v+//3/+P/5//j/9//3//f/+f/4//j/+f/5//n/+P/4//j/+f/5"
  }
}
Server Processing:
  1. Audio Decoding
    • Base64 decode payload
    • Convert μ-law to PCM16 little-endian format
    • Typical frame: 80-160 bytes = 10-20ms of audio
  2. Transcription
    • Stream PCM16 data to Deepgram WebSocket
    • Update LIVE_SIGNALS[call_id]["transcript_live"] with partial results
    • Store finalized transcript segments
  3. Audio Analysis
    • Buffer audio into 160ms chunks (2560 bytes @ 8kHz)
    • Calculate RMS (root mean square) for voice activity detection
    • Apply exponential moving average (EMA) for baseline
    • Compute distress score from deviation above baseline
    • Update LIVE_SIGNALS[call_id] with metrics:
      • chunks: Total audio chunks processed
      • voiced_chunks: Chunks with voice activity
      • voiced_seconds: Total voice duration
      • distress: Current distress score (0.0-1.0)
      • max_distress: Peak distress score
      • ema: Rolling average baseline
  4. WAV Recording
    • Append PCM16 data to buffer for offline processing
    • Saved to data/calls/{timestamp}.wav on disconnect

Event: stop

Streaming session ended.
{
  "event": "stop"
}
Server Processing:
  1. Finalize Deepgram transcription
  2. Write WAV file to disk
  3. Update final transcript in LIVE_SIGNALS
  4. Close WebSocket connection

Audio Processing Details

Voice Activity Detection (VAD)

vad_threshold = 0.02  # 2% of full-scale
rms = rms_norm_pcm16le(chunk)  # 0.0 to 1.0
voiced = rms >= vad_threshold

Distress Score Computation

ema = alpha * rms + (1 - alpha) * ema  # alpha = 0.15
diff = max(0.0, rms - ema)  # deviation above baseline
score = max(distress * 0.9, min(1.0, diff * 8.0))
The distress score:
  • Tracks sudden increases in volume/intensity
  • Uses EMA baseline to adapt to call dynamics
  • Decays slowly (0.9 multiplier) when intensity drops
  • Ranges from 0.0 (calm) to 1.0 (high distress)

Audio Format Conversion

Telnyx sends PCMU (μ-law) encoded audio. The system converts to PCM16:
μ-law (8-bit) → PCM16 (16-bit little-endian)
80 bytes → 160 bytes
10ms @ 8kHz

Live State Management

The LIVE_SIGNALS dictionary maintains per-call state:
LIVE_SIGNALS = {
  "call_id_123": {
    "chunks": 0,                 # Total chunks processed
    "voiced_chunks": 0,          # Chunks with voice
    "voiced_seconds": 0.0,       # Total voice duration
    "ema": 0.0,                  # Rolling baseline
    "distress": 0.0,             # Current distress score
    "max_distress": 0.0,         # Peak distress
    "transcript": "",            # Final transcript
    "transcript_live": "",       # Partial transcript
    "wav_path": None,            # Path to saved audio
    "emotion": None              # Emotion analysis (set on hangup)
  }
}

Real-time Transcription

The system uses Deepgram’s streaming API:
  1. Connection: Opens WebSocket to Deepgram when start event received
  2. Streaming: Forwards PCM16 audio chunks to Deepgram
  3. Callbacks:
    • _on_partial(text, call_id): Updates transcript_live with interim results
    • _on_final(text, call_id): Updates transcript with finalized segments
  4. Finalization: On stop event, closes Deepgram connection and gets final transcript

Implementation

Location: app/api/ws/handler.py:344-549

Configuration

Set in environment variables:
WS_PUBLIC_URL
string
required
Public WebSocket URL that Telnyx can reach (e.g., wss://your-domain.com/ws)
DEEPGRAM_API_KEY
string
required
Deepgram API key for speech-to-text

Error Handling

  • Call ID Not Found: If call_control_id doesn’t map to a call_session_id, processing is skipped
  • WebSocket Disconnect: Gracefully closes, saves audio, finalizes transcript
  • Deepgram Errors: Logged but don’t crash connection; falls back to audio-only analysis
  • Missing Audio Data: Empty payloads are skipped silently

Performance Characteristics

  • Latency: 100-300ms for partial transcripts
  • Throughput: Handles multiple concurrent calls (one WebSocket per call)
  • Buffer Size: 160ms chunks (2560 bytes @ 8kHz PCM16)
  • Memory: ~2MB per minute of audio buffered

Example Debug Output

[ws] other: connected
[ws] start sr=8000
[chunk] seq=0001 rms=0.234 voiced=1 distress=0.045
[live-partial] call=abc123... "help there's a"
[chunk] seq=0002 rms=0.412 voiced=1 distress=0.189
[live-partial] call=abc123... "help there's a fire"
[chunk] seq=0003 rms=0.587 voiced=1 distress=0.456
[live-final] call=abc123... "help there's a fire in the kitchen"
[ws] stop received; frames=245
[ws] saved data/calls/1738539845.wav  duration=39.20s  chunks=245
[ws] closed; total_frames=245

Integration with Call Lifecycle

  1. Incoming Call (call.initiated webhook)
    • System answers and calls streaming_start action
    • Telnyx opens WebSocket to /ws
  2. Active Call (call.answered webhook)
    • Audio streams via WebSocket
    • Real-time processing updates LIVE_SIGNALS
    • UI polls /api/v1/live_queue for updates
  3. Call End (call.hangup webhook)
    • WebSocket receives stop event
    • Final audio saved, transcript finalized
    • Webhook handler performs full analysis
    • Queue item created

Security Considerations

  • No Authentication: WebSocket accepts any connection (internal use only)
  • Production: Should add:
    • Token-based authentication
    • Rate limiting per IP
    • Connection timeout enforcement
    • Input validation on all events
  • Data Privacy: Audio files stored locally in data/calls/
  • Phone Numbers: Automatically masked (last 4 digits only)

Debugging

To view live calls in browser:
http://localhost:8000/debug/live_calls/
This page polls /api/v1/live_queue and displays:
  • Real-time transcript
  • Distress scores
  • Risk levels
  • Call metadata

Build docs developers (and LLMs) love