Skip to main content

Overview

Sonore Phone Agent includes built-in logging and metrics systems for monitoring application health, call activity, and system performance. This guide covers the logging infrastructure and available metrics.

Logging System

Architecture

The logging system is implemented in src/core/logger.py and provides:
  • Structured JSON logging for easy parsing and analysis
  • Context variables for request tracing (tenant_id, call_id)
  • Configurable log levels via environment variable
  • Event-based logging with custom fields

Configuration

Configure logging via environment variables:
# Set log level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
LOG_LEVEL=INFO

# In .env file
LOG_LEVEL=WARNING
The application validates and applies the log level at startup (source/src/core/logger.py:73-74).

Log Format

All logs are emitted as JSON objects with the following structure:
{
  "ts": "2026-03-02T14:30:00",
  "level": "INFO",
  "logger": "app",
  "msg": "Call accepted",
  "tenant_id": "tenant_123",
  "call_id": "call_456",
  "event": "call.accepted",
  "duration_ms": 1250
}
Standard fields (source/src/core/logger.py:26-33):
  • ts - Timestamp in ISO 8601 format (UTC)
  • level - Log level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
  • logger - Logger name (typically “app”)
  • msg - Human-readable message
  • tenant_id - Tenant identifier from context
  • call_id - Call identifier from context
Additional fields can be passed via the extra parameter and are automatically included in the JSON output (source/src/core/logger.py:36-64).

Context Variables

The logging system uses context variables to automatically include tenant and call IDs in all log messages:
from src.core.logger import tenant_id_var, call_id_var

# Set context for current request
tenant_id_var.set("tenant_123")
call_id_var.set("call_456")

# All subsequent logs will include these IDs
logger.info("Processing call")  # Automatically includes tenant_id and call_id
Context variables are implemented using contextvars for async-safe request tracking (source/src/core/logger.py:9-14).

Event Logging

Use the log_event helper for structured event logging:
from src.core.logger import log_event
import logging

# Log an event with custom fields
log_event(
    level=logging.INFO,
    event="call.completed",
    msg="Call completed successfully",
    duration_seconds=120.5,
    end_reason="hangup"
)
This adds an event field to the log output (source/src/core/logger.py:97-98).

Third-Party Log Filtering

The logger automatically reduces noise from third-party libraries (source/src/core/logger.py:86-89):
  • httpx - Set to WARNING level
  • websockets - Set to WARNING level
  • uvicorn.access - Set to WARNING level

Metrics System

LiveMetricsStore

The LiveMetricsStore class provides real-time metrics tracking for call activity (source/src/apps/calls/metrics/live_store.py:9-238). Metrics tracked per tenant and globally:
active_calls
int
Number of currently active calls
accepted_calls
int
Total calls accepted by the system
rejected_calls_capacity
int
Calls rejected due to capacity limits
rejected_calls_tenant_not_configured
int
Calls rejected because tenant is not configured
rejected_calls_instructions_missing
int
Calls rejected due to missing instructions
instructions_db_errors
int
Database errors when fetching instructions
fallback_instructions_used
int
Number of times fallback instructions were used
started_calls
int
Total calls that started successfully
ended_calls
int
Total calls that have ended
failed_calls
int
Calls that ended with an error
referred_calls
int
Calls that were referred to another destination
minutes_processed
float
Total minutes of call time processed

Metric Storage

Metrics are stored in-memory with:
  • Per-tenant tracking - Isolated metrics for each tenant
  • Global aggregation - System-wide metrics under the __global__ key (source/src/apps/calls/metrics/live_store.py:6)
  • Thread-safe operations - All metric updates use async locks (source/src/apps/calls/metrics/live_store.py:19)

Call Gates

The system uses “call gates” to prevent duplicate metric recording (source/src/apps/calls/metrics/live_store.py:8-13):
class CallGatesMetrics:
    accepted: bool = False
    started: bool = False
    ended: bool = False
    ended_at: float | None = None
Each call is tracked by call_id to ensure metrics are only incremented once per state transition.

Metric Operations

await metrics_store.record_accept(
    tenant_id="tenant_123",
    call_id="call_456"
)
Increments accepted_calls for the tenant and globally (source/src/apps/calls/metrics/live_store.py:30-48).
await metrics_store.record_start(
    tenant_id="tenant_123",
    call_id="call_456"
)
Increments started_calls and active_calls (source/src/apps/calls/metrics/live_store.py:50-71).
from src.models.metrics.store import EndReason

await metrics_store.record_end(
    tenant_id="tenant_123",
    call_id="call_456",
    end_reason=EndReason.COMPLETED
)
Increments ended_calls, decrements active_calls, and updates reason-specific counters (source/src/apps/calls/metrics/live_store.py:165-196).End reasons (source/src/models/metrics/store.py:15-19):
  • COMPLETED - Call completed normally
  • HANGUP - User hung up
  • REFERRED - Call transferred (increments referred_calls)
  • ERROR - Call failed (increments failed_calls)
# Capacity limit reached
await metrics_store.record_reject_capacity(
    call_id="call_456",
    tenant_id="tenant_123"
)

# Tenant not configured
await metrics_store.record_reject_tenant_not_configured(
    call_id="call_456",
    tenant_id="tenant_123"
)

# Instructions missing
await metrics_store.record_reject_instructions_missing(
    call_id="call_456",
    tenant_id="tenant_123"
)
# Record call duration in minutes
await metrics_store.record_minutes_processed(
    tenant_id="tenant_123",
    minutes=2.5
)
Accumulates total processing time (source/src/apps/calls/metrics/live_store.py:198-208).

Retrieving Metrics

Get a snapshot of current metrics:
# Global metrics
global_metrics = await metrics_store.snapshot(tenant_id=None)

# Tenant-specific metrics
tenant_metrics = await metrics_store.snapshot(tenant_id="tenant_123")
Returns a dictionary with all metric values (source/src/apps/calls/metrics/live_store.py:210-218).

Metrics Cleanup

The system automatically prunes old call gates to prevent memory growth:
# Prune call gates older than 5 minutes
await metrics_store.prune_call_gates_older_than(age_seconds=300)
This background task runs automatically via the application lifespan (source/src/apps/calls/main.py:19-27):
async def prune_loop(
    store: LiveMetricsStore, 
    interval: float = 60.0,  # Check every 60 seconds
    expiry: float = 300.0     # Delete gates older than 5 minutes
) -> None:

Integration Examples

Exporting to Prometheus

Create a metrics endpoint for Prometheus scraping:
from fastapi import APIRouter, Request
from prometheus_client import Counter, Gauge, generate_latest

router = APIRouter()

# Define Prometheus metrics
active_calls = Gauge('active_calls', 'Number of active calls', ['tenant'])
accepted_calls = Counter('accepted_calls_total', 'Total accepted calls', ['tenant'])
failed_calls = Counter('failed_calls_total', 'Total failed calls', ['tenant'])

@router.get("/metrics")
async def metrics_endpoint(request: Request):
    metrics_store = request.app.state.metrics_store
    
    # Get global metrics
    global_metrics = await metrics_store.snapshot(tenant_id=None)
    
    # Update Prometheus metrics
    active_calls.labels(tenant='global').set(global_metrics['active_calls'])
    accepted_calls.labels(tenant='global')._value.set(global_metrics['accepted_calls'])
    failed_calls.labels(tenant='global')._value.set(global_metrics['failed_calls'])
    
    return generate_latest()

Logging to CloudWatch

The JSON log format integrates seamlessly with AWS CloudWatch Logs:
import watchtower
import logging

from src.core.logger import setup_logging

# Add CloudWatch handler
logger = setup_logging()
cloudwatch_handler = watchtower.CloudWatchLogHandler(
    log_group='/aws/ecs/sonore-phone-agent',
    stream_name='calls-service'
)
logger.addHandler(cloudwatch_handler)

Sending to Datadog

from datadog import initialize, statsd

# Initialize Datadog
options = {
    'api_key': 'YOUR_API_KEY',
    'app_key': 'YOUR_APP_KEY'
}
initialize(**options)

# Send metrics
async def report_metrics_to_datadog(metrics_store):
    metrics = await metrics_store.snapshot(tenant_id=None)
    
    statsd.gauge('sonore.calls.active', metrics['active_calls'])
    statsd.increment('sonore.calls.accepted', metrics['accepted_calls'])
    statsd.increment('sonore.calls.failed', metrics['failed_calls'])
    statsd.gauge('sonore.calls.minutes_processed', metrics['minutes_processed'])

Health Checks

The application provides a basic health check endpoint (source/src/apps/calls/main.py:113-115):
GET /health
Returns:
{
  "status": "ok"
}

Enhanced Health Check

Create a more comprehensive health check that includes system status:
@app.get("/health/detailed")
async def detailed_health(request: Request) -> dict:
    metrics_store = request.app.state.metrics_store
    mongo_client = request.app.state.mongo_client
    
    # Check MongoDB
    try:
        await mongo_client.admin.command('ping')
        db_status = "healthy"
    except Exception as e:
        db_status = f"unhealthy: {str(e)}"
    
    # Get metrics
    metrics = await metrics_store.snapshot(tenant_id=None)
    
    return {
        "status": "ok" if db_status == "healthy" else "degraded",
        "database": db_status,
        "active_calls": metrics['active_calls'],
        "total_calls": metrics['accepted_calls']
    }

Alerting Strategies

High Error Rate

Alert when failed_calls / ended_calls > 5%
error_rate = failed_calls / max(ended_calls, 1)
if error_rate > 0.05:
    send_alert("High error rate detected")

Capacity Issues

Alert when rejected_calls_capacity increases
if rejected_calls_capacity > threshold:
    send_alert("Capacity limit reached")

Database Errors

Alert on instructions_db_errors > 0
if instructions_db_errors > 0:
    send_alert("Database connectivity issues")

Long-Running Calls

Monitor active_calls that don’t decrease
if active_calls > expected and duration > threshold:
    send_alert("Possible stuck calls")

Best Practices

  • Use centralized logging (ELK, Splunk, CloudWatch)
  • Parse JSON logs for filtering and analysis
  • Set up log retention policies
  • Index by tenant_id and call_id for tracing
  • Export metrics to time-series database (Prometheus, InfluxDB)
  • Keep in-memory metrics for real-time dashboards
  • Archive historical metrics for trend analysis
  • Set up automated reporting
  • Track active_calls for load balancing
  • Monitor minutes_processed for billing
  • Watch rejected_calls_capacity to scale resources
  • Analyze failed_calls for reliability improvements
  • Use call_id to trace entire call lifecycle
  • Filter logs by tenant_id for tenant-specific issues
  • Set LOG_LEVEL=DEBUG temporarily for troubleshooting
  • Correlate metrics with log events

Next Steps

Installation

Set up the application from scratch

API Reference

Explore available endpoints

Build docs developers (and LLMs) love