Monitoring & Observability

Overview

Sonore Phone Agent includes built-in logging and metrics systems for monitoring application health, call activity, and system performance. This guide covers the logging infrastructure and available metrics.

Logging System

Architecture

The logging system is implemented in src/core/logger.py and provides:

Structured JSON logging for easy parsing and analysis
Context variables for request tracing (tenant_id, call_id)
Configurable log levels via environment variable
Event-based logging with custom fields

Configuration

Configure logging via environment variables:

# Set log level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
LOG_LEVEL=INFO

# In .env file
LOG_LEVEL=WARNING

The application validates and applies the log level at startup (source/src/core/logger.py:73-74).

Log Format

All logs are emitted as JSON objects with the following structure:

{
  "ts": "2026-03-02T14:30:00",
  "level": "INFO",
  "logger": "app",
  "msg": "Call accepted",
  "tenant_id": "tenant_123",
  "call_id": "call_456",
  "event": "call.accepted",
  "duration_ms": 1250
}

Standard fields (source/src/core/logger.py:26-33):

ts - Timestamp in ISO 8601 format (UTC)
level - Log level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
logger - Logger name (typically “app”)
msg - Human-readable message
tenant_id - Tenant identifier from context
call_id - Call identifier from context

Additional fields can be passed via the extra parameter and are automatically included in the JSON output (source/src/core/logger.py:36-64).

Context Variables

The logging system uses context variables to automatically include tenant and call IDs in all log messages:

from src.core.logger import tenant_id_var, call_id_var

# Set context for current request
tenant_id_var.set("tenant_123")
call_id_var.set("call_456")

# All subsequent logs will include these IDs
logger.info("Processing call")  # Automatically includes tenant_id and call_id

Context variables are implemented using contextvars for async-safe request tracking (source/src/core/logger.py:9-14).

Event Logging

Use the log_event helper for structured event logging:

from src.core.logger import log_event
import logging

# Log an event with custom fields
log_event(
    level=logging.INFO,
    event="call.completed",
    msg="Call completed successfully",
    duration_seconds=120.5,
    end_reason="hangup"
)

This adds an event field to the log output (source/src/core/logger.py:97-98).

Third-Party Log Filtering

The logger automatically reduces noise from third-party libraries (source/src/core/logger.py:86-89):

httpx - Set to WARNING level
websockets - Set to WARNING level
uvicorn.access - Set to WARNING level

Metrics System

LiveMetricsStore

The LiveMetricsStore class provides real-time metrics tracking for call activity (source/src/apps/calls/metrics/live_store.py:9-238). Metrics tracked per tenant and globally:

active_calls

int

Number of currently active calls

accepted_calls

int

Total calls accepted by the system

rejected_calls_capacity

int

Calls rejected due to capacity limits

rejected_calls_tenant_not_configured

int

Calls rejected because tenant is not configured

rejected_calls_instructions_missing

int

Calls rejected due to missing instructions

instructions_db_errors

int

Database errors when fetching instructions

fallback_instructions_used

int

Number of times fallback instructions were used

started_calls

int

Total calls that started successfully

ended_calls

int

Total calls that have ended

failed_calls

int

Calls that ended with an error

referred_calls

int

Calls that were referred to another destination

minutes_processed

float

Total minutes of call time processed

Metric Storage

Metrics are stored in-memory with:

Per-tenant tracking - Isolated metrics for each tenant
Global aggregation - System-wide metrics under the __global__ key (source/src/apps/calls/metrics/live_store.py:6)
Thread-safe operations - All metric updates use async locks (source/src/apps/calls/metrics/live_store.py:19)

Call Gates

The system uses “call gates” to prevent duplicate metric recording (source/src/apps/calls/metrics/live_store.py:8-13):

class CallGatesMetrics:
    accepted: bool = False
    started: bool = False
    ended: bool = False
    ended_at: float | None = None

Each call is tracked by call_id to ensure metrics are only incremented once per state transition.

Metric Operations

Recording call acceptance

await metrics_store.record_accept(
    tenant_id="tenant_123",
    call_id="call_456"
)

Increments accepted_calls for the tenant and globally (source/src/apps/calls/metrics/live_store.py:30-48).

Recording call start

await metrics_store.record_start(
    tenant_id="tenant_123",
    call_id="call_456"
)

Increments started_calls and active_calls (source/src/apps/calls/metrics/live_store.py:50-71).

Recording call end

from src.models.metrics.store import EndReason

await metrics_store.record_end(
    tenant_id="tenant_123",
    call_id="call_456",
    end_reason=EndReason.COMPLETED
)

Increments ended_calls, decrements active_calls, and updates reason-specific counters (source/src/apps/calls/metrics/live_store.py:165-196).End reasons (source/src/models/metrics/store.py:15-19):

COMPLETED - Call completed normally
HANGUP - User hung up
REFERRED - Call transferred (increments referred_calls)
ERROR - Call failed (increments failed_calls)

Recording rejections

# Capacity limit reached
await metrics_store.record_reject_capacity(
    call_id="call_456",
    tenant_id="tenant_123"
)

# Tenant not configured
await metrics_store.record_reject_tenant_not_configured(
    call_id="call_456",
    tenant_id="tenant_123"
)

# Instructions missing
await metrics_store.record_reject_instructions_missing(
    call_id="call_456",
    tenant_id="tenant_123"
)

Recording usage

# Record call duration in minutes
await metrics_store.record_minutes_processed(
    tenant_id="tenant_123",
    minutes=2.5
)

Accumulates total processing time (source/src/apps/calls/metrics/live_store.py:198-208).

Retrieving Metrics

Get a snapshot of current metrics:

# Global metrics
global_metrics = await metrics_store.snapshot(tenant_id=None)

# Tenant-specific metrics
tenant_metrics = await metrics_store.snapshot(tenant_id="tenant_123")

Returns a dictionary with all metric values (source/src/apps/calls/metrics/live_store.py:210-218).

Metrics Cleanup

The system automatically prunes old call gates to prevent memory growth:

# Prune call gates older than 5 minutes
await metrics_store.prune_call_gates_older_than(age_seconds=300)

This background task runs automatically via the application lifespan (source/src/apps/calls/main.py:19-27):

async def prune_loop(
    store: LiveMetricsStore, 
    interval: float = 60.0,  # Check every 60 seconds
    expiry: float = 300.0     # Delete gates older than 5 minutes
) -> None:

Integration Examples

Exporting to Prometheus

Create a metrics endpoint for Prometheus scraping:

from fastapi import APIRouter, Request
from prometheus_client import Counter, Gauge, generate_latest

router = APIRouter()

# Define Prometheus metrics
active_calls = Gauge('active_calls', 'Number of active calls', ['tenant'])
accepted_calls = Counter('accepted_calls_total', 'Total accepted calls', ['tenant'])
failed_calls = Counter('failed_calls_total', 'Total failed calls', ['tenant'])

@router.get("/metrics")
async def metrics_endpoint(request: Request):
    metrics_store = request.app.state.metrics_store
    
    # Get global metrics
    global_metrics = await metrics_store.snapshot(tenant_id=None)
    
    # Update Prometheus metrics
    active_calls.labels(tenant='global').set(global_metrics['active_calls'])
    accepted_calls.labels(tenant='global')._value.set(global_metrics['accepted_calls'])
    failed_calls.labels(tenant='global')._value.set(global_metrics['failed_calls'])
    
    return generate_latest()

Logging to CloudWatch

The JSON log format integrates seamlessly with AWS CloudWatch Logs:

import watchtower
import logging

from src.core.logger import setup_logging

# Add CloudWatch handler
logger = setup_logging()
cloudwatch_handler = watchtower.CloudWatchLogHandler(
    log_group='/aws/ecs/sonore-phone-agent',
    stream_name='calls-service'
)
logger.addHandler(cloudwatch_handler)

Sending to Datadog

from datadog import initialize, statsd

# Initialize Datadog
options = {
    'api_key': 'YOUR_API_KEY',
    'app_key': 'YOUR_APP_KEY'
}
initialize(**options)

# Send metrics
async def report_metrics_to_datadog(metrics_store):
    metrics = await metrics_store.snapshot(tenant_id=None)
    
    statsd.gauge('sonore.calls.active', metrics['active_calls'])
    statsd.increment('sonore.calls.accepted', metrics['accepted_calls'])
    statsd.increment('sonore.calls.failed', metrics['failed_calls'])
    statsd.gauge('sonore.calls.minutes_processed', metrics['minutes_processed'])

Health Checks

The application provides a basic health check endpoint (source/src/apps/calls/main.py:113-115):

GET /health

Returns:

{
  "status": "ok"
}

Enhanced Health Check

Create a more comprehensive health check that includes system status:

@app.get("/health/detailed")
async def detailed_health(request: Request) -> dict:
    metrics_store = request.app.state.metrics_store
    mongo_client = request.app.state.mongo_client
    
    # Check MongoDB
    try:
        await mongo_client.admin.command('ping')
        db_status = "healthy"
    except Exception as e:
        db_status = f"unhealthy: {str(e)}"
    
    # Get metrics
    metrics = await metrics_store.snapshot(tenant_id=None)
    
    return {
        "status": "ok" if db_status == "healthy" else "degraded",
        "database": db_status,
        "active_calls": metrics['active_calls'],
        "total_calls": metrics['accepted_calls']
    }

Alerting Strategies

High Error Rate

Alert when failed_calls / ended_calls > 5%

error_rate = failed_calls / max(ended_calls, 1)
if error_rate > 0.05:
    send_alert("High error rate detected")

Capacity Issues

Alert when rejected_calls_capacity increases

if rejected_calls_capacity > threshold:
    send_alert("Capacity limit reached")

Database Errors

Alert on instructions_db_errors > 0

if instructions_db_errors > 0:
    send_alert("Database connectivity issues")

Long-Running Calls

Monitor active_calls that don’t decrease

if active_calls > expected and duration > threshold:
    send_alert("Possible stuck calls")

Best Practices

Log aggregation

Use centralized logging (ELK, Splunk, CloudWatch)
Parse JSON logs for filtering and analysis
Set up log retention policies
Index by tenant_id and call_id for tracing

Metrics retention

Export metrics to time-series database (Prometheus, InfluxDB)
Keep in-memory metrics for real-time dashboards
Archive historical metrics for trend analysis
Set up automated reporting

Performance monitoring

Track active_calls for load balancing
Monitor minutes_processed for billing
Watch rejected_calls_capacity to scale resources
Analyze failed_calls for reliability improvements

Debugging

Use call_id to trace entire call lifecycle
Filter logs by tenant_id for tenant-specific issues
Set LOG_LEVEL=DEBUG temporarily for troubleshooting
Correlate metrics with log events

Next Steps

Installation

Set up the application from scratch

API Reference

Explore available endpoints

Getting Started

Core Concepts

Configuration

Deployment

Overview

Logging System

Architecture

Configuration

Log Format

Context Variables

Event Logging

Third-Party Log Filtering

Metrics System

LiveMetricsStore

Metric Storage

Call Gates

Metric Operations

Retrieving Metrics

Metrics Cleanup

Integration Examples

Exporting to Prometheus

Logging to CloudWatch

Sending to Datadog

Health Checks

Enhanced Health Check

Alerting Strategies

High Error Rate

Capacity Issues

Database Errors

Long-Running Calls

Best Practices

Next Steps

Installation

API Reference

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Configuration

Deployment

​Overview

​Logging System

​Architecture

​Configuration

​Log Format

​Context Variables

​Event Logging

​Third-Party Log Filtering

​Metrics System

​LiveMetricsStore

​Metric Storage

​Call Gates

​Metric Operations

​Retrieving Metrics

​Metrics Cleanup

​Integration Examples

​Exporting to Prometheus

​Logging to CloudWatch

​Sending to Datadog

​Health Checks

​Enhanced Health Check

​Alerting Strategies

High Error Rate

Capacity Issues

Database Errors

Long-Running Calls

​Best Practices

​Next Steps

Installation

API Reference

Build docs developers (and LLMs) love

Overview

Logging System

Architecture

Configuration

Log Format

Context Variables

Event Logging

Third-Party Log Filtering

Metrics System

LiveMetricsStore

Metric Storage

Call Gates

Metric Operations

Retrieving Metrics

Metrics Cleanup

Integration Examples

Exporting to Prometheus

Logging to CloudWatch

Sending to Datadog

Health Checks

Enhanced Health Check

Alerting Strategies

Best Practices

Next Steps