Operational Runbook - DispatchAI

Starting the Application

Development Mode

Use the provided startup script that handles both the server and ngrok tunnel:

./scripts/dev_start.sh

This script:

Loads environment variables from .env
Starts uvicorn server on configured port (default: 8000)
Starts ngrok tunnel with dual HTTP/WebSocket endpoints
Prints the public URL after tunnel establishment
Shows FastAPI logs in the terminal

Script location: scripts/dev_start.sh

Manual Startup

If you need more control:

# Load environment
export $(grep -vE '^\s*#' .env | xargs)

# Start server
uvicorn app.main:app --host 0.0.0.0 --port ${PORT:-8000}

# In separate terminal: start ngrok
ngrok http ${PORT:-8000} --config ops/ngrok.yml

Production Mode

Production deployment currently uses the same in-memory storage as development. For production use, implement persistent storage (PostgreSQL + S3) before deploying.

# Set production environment
export APP_ENV=prod

# Run with production-grade server
gunicorn app.main:app \
  --workers 4 \
  --worker-class uvicorn.workers.UvicornWorker \
  --bind 0.0.0.0:8000 \
  --access-logfile - \
  --error-logfile -

Stopping the Application

Graceful Shutdown

# Press Ctrl+C in the terminal running dev_start.sh
# The script handles cleanup of both uvicorn and ngrok processes

The dev_start.sh script includes a trap handler that automatically kills ngrok when the script exits:

cleanup() {
  [[ -n "${NGROK_PID-}" ]] && kill "${NGROK_PID}" 2>/dev/null || true
}
trap cleanup EXIT INT TERM

Force Kill

If the graceful shutdown fails:

# Kill uvicorn processes
pkill -f "uvicorn app.main:app"

# Kill ngrok processes
pkill -f ngrok

Health Checks

Basic Health Check

curl http://localhost:8000/health

Expected response:

{"status": "ok"}

WebSocket Connection Test

Verify WebSocket endpoint is accessible:

wscat -c ws://localhost:8000/ws

External Webhook Test

Test from outside your network (requires ngrok):

curl https://your-ngrok-url.ngrok.io/health

Configuration Updates

Updating Webhook URLs

When your ngrok tunnel URL changes (happens on every restart without a paid plan):

Note the new URL from dev_start.sh output:

ngrok public URL: https://abc123.ngrok.io

Update .env file:

WS_PUBLIC_URL=wss://abc123.ngrok.io/ws
HTTP_PUBLIC_URL=https://abc123.ngrok.io

Update Telnyx portal:
- Log in to https://portal.telnyx.com
- Navigate to “Messaging” or “Voice” webhooks
- Update webhook URL to: https://abc123.ngrok.io/api/v1/call/incoming
Restart application:
```
./scripts/dev_start.sh
```

Webhook delivery fails silently if URLs are not updated. Always verify URLs match after tunnel restart.

Updating API Keys

# Edit .env
vim .env

# Update the relevant key
OPENAI_API_KEY=sk-new-key-here

# Restart required
./scripts/dev_start.sh

Monitoring Active Calls

Live Call Dashboard

Browser-based debug view showing active calls with real-time updates:

open http://localhost:8000/debug/live_calls/

Displays:

Call status (RINGING, ACTIVE, ENDED)
Live transcript (streaming)
Audio distress metrics
Emotion classification
Risk level assessment
Service category and tags

Update frequency: 1 second

Queue Dashboard

View the dispatcher queue with priority ranking:

open http://localhost:8000/debug/list_queue/

Displays:

Queue status (OPEN, IN_PROGRESS, DISPATCHED, RESOLVED)
Risk level and score
Emergency category (EMS, FIRE, POLICE, OTHER)
Call summary
Timestamp

Recent Calls

Historical view of processed calls:

open http://localhost:8000/debug/list_calls/

API Operations

Query the Queue

curl http://localhost:8000/api/v1/queue | jq

Response format:

[
  {
    "id": "call_session_123",
    "risk_level": "CRITICAL",
    "risk_score": 0.87,
    "category": "EMS",
    "tags": ["CARDIAC_ARREST", "UNCONSCIOUS"],
    "emotion": "HIGHLY_DISTRESSED",
    "summary": "Caller reports person collapsed and not breathing.",
    "status": "OPEN",
    "created_at": "2026-03-03T23:45:12Z",
    "from_masked": "•••5678",
    "to": "+15551234567"
  }
]

Get Call Details

curl http://localhost:8000/api/v1/calls/{call_id} | jq

Update Queue Status

Mark a call as in progress:

curl -X PATCH http://localhost:8000/api/v1/queue/{call_id}/status \
  -H "Content-Type: application/json" \
  -d '{
    "status": "IN_PROGRESS",
    "note": "Dispatching EMS unit 42"
  }'

Valid status values: OPEN, IN_PROGRESS, DISPATCHED, RESOLVED, CANCELLED

Live Queue (Active Calls)

curl http://localhost:8000/api/v1/live_queue | jq

Returns calls currently streaming (not yet ended).

Data Management

Call Recordings

Recordings are saved locally during development:

# View recordings
ls -lh data/recordings/

# Location referenced in LIVE_SIGNALS dict
# wav_path: data/recordings/{call_id}.wav

Replay a Call

Re-process a saved recording through the pipeline:

python scripts/replay_call.py data/recordings/call_session_123.wav

Use cases:

Testing agent improvements
Debugging classification issues
Training data collection

Export Call Data

Export processed packets for analysis:

python scripts/export_packets.py --output calls.json --limit 100

In-Memory Data Limits

Development mode stores data in memory with fixed limits:

Recent calls: 200 (configurable via InMemoryCallStore max_recent parameter)
Queue items: unlimited (stored in dict)

Data is lost on application restart. For production, implement persistent storage.

Current storage backend selection (app/main.py:150-154):

if APP_ENV == AppEnv.DEV:
    CALL_STORE: CallStore = InMemoryCallStore(max_recent=200)
else:
    # For now, still in-memory; later drop in Postgres-backed class here.
    CALL_STORE: CallStore = InMemoryCallStore(max_recent=200)

Troubleshooting

Webhooks Not Received

Symptoms:

No logs when calling the Telnyx number
Incoming calls not appearing in queue

Diagnosis:

# 1. Check server is running
curl http://localhost:8000/health

# 2. Check ngrok tunnel
curl https://your-ngrok-url.ngrok.io/health

# 3. Verify webhook configuration
cat ops/urls.txt

# 4. Check Telnyx portal webhook settings

Solutions:

Verify ngrok is running: ps aux | grep ngrok
Check WS_PUBLIC_URL in .env matches current ngrok URL
Update webhook URL in Telnyx portal
Restart application: ./scripts/dev_start.sh

Audio Stream Not Starting

Symptoms:

Call answered but no transcription
Empty transcript_live in debug dashboard

Diagnosis:

# Check logs for streaming_start errors
grep "streaming_start" logs.txt

# Verify WS_PUBLIC_URL is set
echo $WS_PUBLIC_URL

Solutions:

Ensure WS_PUBLIC_URL is WebSocket protocol: wss:// not https://
Check ngrok WebSocket tunnel: cat ops/ngrok.yml
Verify Telnyx API key has call control permissions

Transcription Failures

Symptoms:

transcript: null in call packets
No final transcript after call ends

Diagnosis:

# Check for Deepgram errors
grep "deepgram error" logs.txt

# Test API key
curl -X POST https://api.deepgram.com/v1/listen \
  -H "Authorization: Token $DEEPGRAM_API_KEY" \
  -H "Content-Type: audio/wav" \
  --data-binary @test.wav

Solutions:

Verify DEEPGRAM_API_KEY is set and valid
Check Deepgram account balance/quota
Ensure WAV file was saved: ls data/recordings/
Check file format: 8kHz, 16-bit PCM

High Distress False Positives

Symptoms:

Non-emergency calls marked CRITICAL
Distress scores too high

Diagnosis:

# Review audio analysis agent
cat app/agents/audio_track.py

# Check distress calculation
grep "distress_score" logs.txt | tail -20

Solutions:

Review distress bucketing logic in compute_risk_level() (app/main.py:225-321)
Adjust thresholds based on observed patterns
Enable OpenAI emotion analysis for better accuracy: EMOTION_PROVIDER=openai
Review semantic tag detection (CRITICAL_TAGS, ELEVATED_TAGS)

Summary Generation Failures

Symptoms:

summary: null or generic placeholder text
“No transcript available” for valid calls

Diagnosis:

# Check if OpenAI key is set
echo $OPENAI_API_KEY

# Look for summary errors
grep "summary" logs.txt | grep -i error

Solutions:

Set OPENAI_API_KEY for GPT-powered summaries
Without OpenAI key, system falls back to heuristic (first sentence extraction)
Verify transcript exists before summary generation
Check OpenAI account quota/balance

Memory Growth

Symptoms:

Application memory usage increasing over time
Slower response times

Diagnosis:

# Monitor process memory
ps aux | grep uvicorn

# Check in-memory storage size
curl http://localhost:8000/api/v1/calls | jq length

Solutions:

In-memory store has fixed limits (200 recent calls)
Old calls are automatically rotated out (deque with maxlen)
For production, implement persistent storage to prevent memory issues
Restart application if memory usage is concerning

ngrok Tunnel Disconnects

Symptoms:

Webhooks work then suddenly stop
ngrok process terminated

Diagnosis:

# Check ngrok process
ps aux | grep ngrok

# Check ngrok logs
tail -f ${NGROK_LOG_FILE:-/tmp/ngrok.log}

Solutions:

Free ngrok tunnels disconnect after 2 hours - restart dev_start.sh
Upgrade to paid ngrok plan for persistent URLs
Implement automatic webhook URL update on tunnel restart
Monitor ngrok API: curl http://127.0.0.1:4040/api/tunnels

Log Analysis

Log Format

Logs are written to stdout in structured format:

# app/core/logging.py
import logging

def setup_logging():
    logging.basicConfig(
        level=logging.INFO,
        format="%(asctime)s %(levelname)s %(message)s"
    )

Key Log Patterns

# Call lifecycle
grep "\[telephony\]" logs.txt

# Audio streaming
grep "streaming_start" logs.txt

# Transcription
grep "\[stt\]" logs.txt

# Emotion analysis
grep "\[emotion\]" logs.txt

# Triage decisions
grep "\[triage:minimal\]" logs.txt

# Summary generation
grep "\[summary\]" logs.txt

Redirecting Logs

# Save logs to file
./scripts/dev_start.sh 2>&1 | tee logs.txt

# Filter errors only
./scripts/dev_start.sh 2>&1 | grep -i error

# JSON formatted logs (future)
# Configure structured logging in app/core/logging.py

Emergency Procedures

System Unresponsive

Check if process is running:
```
ps aux | grep uvicorn
```
Check port availability:
```
lsof -i :8000
```

Force restart:

pkill -f "uvicorn app.main:app"
./scripts/dev_start.sh

Data Loss Prevention

In development mode, ALL DATA IS LOST on restart. For production:

Implement PostgreSQL backend for queue persistence
Implement S3 storage for call recordings
Enable database backups

Critical Call Missed

If a high-priority call was not properly triaged:

Find the call recording:
```
ls -lt data/recordings/ | head
```

Replay through pipeline:

python scripts/replay_call.py data/recordings/call_session_XYZ.wav

Manually review classification:

curl http://localhost:8000/api/v1/calls/call_session_XYZ | jq .risk

Report issue for agent tuning

Maintenance

Dependencies

Update Python packages:

pip install -U pip
pip install -r requirements.txt --upgrade

ngrok Configuration

The ops/ngrok.yml file configures dual tunnels:

version: "2"
tunnels:
  api:
    addr: 8000
    proto: http
    inspect: false
  ws:
    addr: 8000
    proto: http
    inspect: false

Both tunnels point to the same port. Telnyx requires HTTP for webhooks and WebSocket upgrade for audio streaming.

Clearing In-Memory State

Simply restart the application:

# All queues and calls cleared
./scripts/dev_start.sh

Performance Tuning

Worker Processes

For production, adjust based on CPU cores:

# General rule: (2 x cores) + 1
gunicorn app.main:app \
  --workers $((2 * $(nproc) + 1)) \
  --worker-class uvicorn.workers.UvicornWorker

Request Timeouts

HTTP client timeouts are hardcoded in app/main.py:

# Telnyx API calls
async with httpx.AsyncClient(timeout=8) as c:

# Deepgram transcription
async with httpx.AsyncClient(timeout=30) as c:

# OpenAI/Deepgram emotion
async with httpx.AsyncClient(timeout=10) as c:

Adjust based on observed latency.

Database Connection Pool

When PostgreSQL is implemented, configure connection pooling:

# Future: app/core/database.py
engine = create_async_engine(
    DATABASE_URL,
    pool_size=20,
    max_overflow=10,
    pool_timeout=30,
)

Operations

​Starting the Application

​Development Mode

​Manual Startup

​Production Mode

​Stopping the Application

​Graceful Shutdown

​Force Kill

​Health Checks

​Basic Health Check

​WebSocket Connection Test

​External Webhook Test

​Configuration Updates

​Updating Webhook URLs

​Updating API Keys

​Monitoring Active Calls

​Live Call Dashboard

​Queue Dashboard

​Recent Calls

​API Operations

​Query the Queue

​Get Call Details

​Update Queue Status

​Live Queue (Active Calls)

​Data Management

​Call Recordings

​Replay a Call

​Export Call Data

​In-Memory Data Limits

​Troubleshooting

​Webhooks Not Received

​Audio Stream Not Starting

​Transcription Failures

​High Distress False Positives

​Summary Generation Failures

​Memory Growth

​ngrok Tunnel Disconnects

​Log Analysis

​Log Format

​Key Log Patterns

​Redirecting Logs

​Emergency Procedures

​System Unresponsive

​Data Loss Prevention

​Critical Call Missed

​Maintenance

​Dependencies

​ngrok Configuration

​Clearing In-Memory State

​Performance Tuning

​Worker Processes

​Request Timeouts

​Database Connection Pool

Build docs developers (and LLMs) love

Starting the Application

Development Mode

Manual Startup

Production Mode

Stopping the Application

Graceful Shutdown

Force Kill

Health Checks

Basic Health Check

WebSocket Connection Test

External Webhook Test

Configuration Updates

Updating Webhook URLs

Updating API Keys

Monitoring Active Calls

Live Call Dashboard

Queue Dashboard

Recent Calls

API Operations

Query the Queue

Get Call Details

Update Queue Status

Live Queue (Active Calls)

Data Management

Call Recordings

Replay a Call

Export Call Data

In-Memory Data Limits

Troubleshooting

Webhooks Not Received

Audio Stream Not Starting

Transcription Failures

High Distress False Positives

Summary Generation Failures

Memory Growth

ngrok Tunnel Disconnects

Log Analysis

Log Format

Key Log Patterns

Redirecting Logs

Emergency Procedures

System Unresponsive

Data Loss Prevention

Critical Call Missed

Maintenance

Dependencies

ngrok Configuration

Clearing In-Memory State

Performance Tuning

Worker Processes

Request Timeouts

Database Connection Pool