WebSocket Audio Streaming

WebSocket Endpoint

/ws

WebSocket

Real-time audio streaming endpoint for Telnyx telephony integration

Endpoint

WS /ws
WSS /ws (production with TLS)

Description

WebSocket endpoint that receives real-time audio streams from Telnyx during active calls. The endpoint performs live transcription using Deepgram, analyzes audio for distress signals, and maintains real-time call state.

Connection

Telnyx initiates the WebSocket connection when the system calls the streaming_start action on a call.

Connection Example (JavaScript)

const ws = new WebSocket('wss://your-domain.com/ws');

ws.onopen = () => {
  console.log('WebSocket connected');
};

ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  console.log('Event:', data.event);
};

ws.onerror = (error) => {
  console.error('WebSocket error:', error);
};

ws.onclose = () => {
  console.log('WebSocket closed');
};

Message Protocol

All messages are JSON-formatted text frames.

Client → Server Messages

Event: connected

Initial connection acknowledgment from Telnyx.

{
  "event": "connected"
}

Event: start

Streaming session started. Contains call metadata and audio configuration.

event

string

required

Always "start"

start

object

required

Stream configuration

start.call_control_id

string

required

Telnyx call control ID

start.sampling_rate

number

required

Audio sample rate in Hz (typically 8000)

Example

{
  "event": "start",
  "start": {
    "call_control_id": "v3:xyz789-abc123",
    "sampling_rate": 8000
  }
}

Server Processing:

Resolves call_control_id to call_session_id using CALL_CONTROL_TO_CALL_ID map
Initializes call-specific state in LIVE_SIGNALS dictionary
Starts Deepgram real-time transcription session
Prepares audio processing buffers

Event: media

Audio data packet (sent repeatedly during call).

event

string

required

Always "media"

media

object

required

Audio payload

media.payload

string

required

Base64-encoded audio data (PCMU/μ-law format, typically 80 or 160 bytes)

Example

{
  "event": "media",
  "media": {
    "payload": "/v/9//v+//3/+P/5//j/9//3//f/+f/4//j/+f/5//n/+P/4//j/+f/5"
  }
}

Server Processing:

Audio Decoding
- Base64 decode payload
- Convert μ-law to PCM16 little-endian format
- Typical frame: 80-160 bytes = 10-20ms of audio
Transcription
- Stream PCM16 data to Deepgram WebSocket
- Update LIVE_SIGNALS[call_id]["transcript_live"] with partial results
- Store finalized transcript segments
Audio Analysis
- Buffer audio into 160ms chunks (2560 bytes @ 8kHz)
- Calculate RMS (root mean square) for voice activity detection
- Apply exponential moving average (EMA) for baseline
- Compute distress score from deviation above baseline
- Update LIVE_SIGNALS[call_id] with metrics:
  - chunks: Total audio chunks processed
  - voiced_chunks: Chunks with voice activity
  - voiced_seconds: Total voice duration
  - distress: Current distress score (0.0-1.0)
  - max_distress: Peak distress score
  - ema: Rolling average baseline
WAV Recording
- Append PCM16 data to buffer for offline processing
- Saved to data/calls/{timestamp}.wav on disconnect

Event: stop

Streaming session ended.

{
  "event": "stop"
}

Server Processing:

Finalize Deepgram transcription
Write WAV file to disk
Update final transcript in LIVE_SIGNALS
Close WebSocket connection

Audio Processing Details

Voice Activity Detection (VAD)

vad_threshold = 0.02  # 2% of full-scale
rms = rms_norm_pcm16le(chunk)  # 0.0 to 1.0
voiced = rms >= vad_threshold

Distress Score Computation

ema = alpha * rms + (1 - alpha) * ema  # alpha = 0.15
diff = max(0.0, rms - ema)  # deviation above baseline
score = max(distress * 0.9, min(1.0, diff * 8.0))

The distress score:

Tracks sudden increases in volume/intensity
Uses EMA baseline to adapt to call dynamics
Decays slowly (0.9 multiplier) when intensity drops
Ranges from 0.0 (calm) to 1.0 (high distress)

Audio Format Conversion

Telnyx sends PCMU (μ-law) encoded audio. The system converts to PCM16:

μ-law (8-bit) → PCM16 (16-bit little-endian)
80 bytes → 160 bytes
10ms @ 8kHz

Live State Management

The LIVE_SIGNALS dictionary maintains per-call state:

LIVE_SIGNALS = {
  "call_id_123": {
    "chunks": 0,                 # Total chunks processed
    "voiced_chunks": 0,          # Chunks with voice
    "voiced_seconds": 0.0,       # Total voice duration
    "ema": 0.0,                  # Rolling baseline
    "distress": 0.0,             # Current distress score
    "max_distress": 0.0,         # Peak distress
    "transcript": "",            # Final transcript
    "transcript_live": "",       # Partial transcript
    "wav_path": None,            # Path to saved audio
    "emotion": None              # Emotion analysis (set on hangup)
  }
}

Real-time Transcription

The system uses Deepgram’s streaming API:

Connection: Opens WebSocket to Deepgram when start event received
Streaming: Forwards PCM16 audio chunks to Deepgram
Callbacks:
- _on_partial(text, call_id): Updates transcript_live with interim results
- _on_final(text, call_id): Updates transcript with finalized segments
Finalization: On stop event, closes Deepgram connection and gets final transcript

Implementation

Location: app/api/ws/handler.py:344-549

Configuration

Set in environment variables:

WS_PUBLIC_URL

string

required

Public WebSocket URL that Telnyx can reach (e.g., wss://your-domain.com/ws)

DEEPGRAM_API_KEY

string

required

Deepgram API key for speech-to-text

Error Handling

Call ID Not Found: If call_control_id doesn’t map to a call_session_id, processing is skipped
WebSocket Disconnect: Gracefully closes, saves audio, finalizes transcript
Deepgram Errors: Logged but don’t crash connection; falls back to audio-only analysis
Missing Audio Data: Empty payloads are skipped silently

Performance Characteristics

Latency: 100-300ms for partial transcripts
Throughput: Handles multiple concurrent calls (one WebSocket per call)
Buffer Size: 160ms chunks (2560 bytes @ 8kHz PCM16)
Memory: ~2MB per minute of audio buffered

Example Debug Output

[ws] other: connected
[ws] start sr=8000
[chunk] seq=0001 rms=0.234 voiced=1 distress=0.045
[live-partial] call=abc123... "help there's a"
[chunk] seq=0002 rms=0.412 voiced=1 distress=0.189
[live-partial] call=abc123... "help there's a fire"
[chunk] seq=0003 rms=0.587 voiced=1 distress=0.456
[live-final] call=abc123... "help there's a fire in the kitchen"
[ws] stop received; frames=245
[ws] saved data/calls/1738539845.wav  duration=39.20s  chunks=245
[ws] closed; total_frames=245

Integration with Call Lifecycle

Incoming Call (call.initiated webhook)
- System answers and calls streaming_start action
- Telnyx opens WebSocket to /ws
Active Call (call.answered webhook)
- Audio streams via WebSocket
- Real-time processing updates LIVE_SIGNALS
- UI polls /api/v1/live_queue for updates
Call End (call.hangup webhook)
- WebSocket receives stop event
- Final audio saved, transcript finalized
- Webhook handler performs full analysis
- Queue item created

Security Considerations

No Authentication: WebSocket accepts any connection (internal use only)
Production: Should add:
- Token-based authentication
- Rate limiting per IP
- Connection timeout enforcement
- Input validation on all events
Data Privacy: Audio files stored locally in data/calls/
Phone Numbers: Automatically masked (last 4 digits only)

Debugging

To view live calls in browser:

http://localhost:8000/debug/live_calls/

This page polls /api/v1/live_queue and displays:

Real-time transcript
Distress scores
Risk levels
Call metadata

Endpoints

Schemas

WebSocket Audio Streaming

WebSocket Endpoint

Endpoint

Description

Connection

Connection Example (JavaScript)

Message Protocol

Client → Server Messages

Event: connected

Event: start

Example

Event: media

Example

Event: stop

Audio Processing Details

Voice Activity Detection (VAD)

Distress Score Computation

Audio Format Conversion

Live State Management

Real-time Transcription

Implementation

Configuration

Error Handling

Performance Characteristics

Example Debug Output

Integration with Call Lifecycle

Security Considerations

Debugging

Build docs developers (and LLMs) love

Endpoints

Schemas

​WebSocket Endpoint

​Endpoint

​Description

​Connection

​Connection Example (JavaScript)

​Message Protocol

​Client → Server Messages

​Event: connected

​Event: start

​Example

​Event: media

​Example

​Event: stop

​Audio Processing Details

​Voice Activity Detection (VAD)

​Distress Score Computation

​Audio Format Conversion

​Live State Management

​Real-time Transcription

​Implementation

​Configuration

​Error Handling

​Performance Characteristics

​Example Debug Output

​Integration with Call Lifecycle

​Security Considerations

​Debugging

Build docs developers (and LLMs) love

WebSocket Endpoint

Endpoint

Description

Connection

Connection Example (JavaScript)

Message Protocol

Client → Server Messages

Event: connected

Event: start

Example

Event: media

Example

Event: stop

Audio Processing Details

Voice Activity Detection (VAD)

Distress Score Computation

Audio Format Conversion

Live State Management

Real-time Transcription

Implementation

Configuration

Error Handling

Performance Characteristics

Example Debug Output

Integration with Call Lifecycle

Security Considerations

Debugging