Metrics and Monitoring

Overview

The system tracks real-time call metrics using LiveMetricsStore (src/apps/calls/metrics/live_store.py). Metrics are maintained per-tenant and globally.

Available Metrics

All metrics are defined in MetricStore model (src/models/metrics/store.py:22):

class MetricStore(BaseModel):
    active_calls: int = 0
    accepted_calls: int = 0
    rejected_calls_capacity: int = 0
    rejected_calls_tenant_not_configured: int = 0
    rejected_calls_instructions_missing: int = 0
    instructions_db_errors: int = 0
    fallback_instructions_used: int = 0
    started_calls: int = 0
    ended_calls: int = 0
    failed_calls: int = 0
    referred_calls: int = 0
    minutes_processed: float = 0.0

Metric Definitions

Metric	Type	Description
`active_calls`	Counter	Current number of active call sessions
`accepted_calls`	Counter	Total calls accepted (cumulative)
`rejected_calls_capacity`	Counter	Calls rejected due to capacity limits
`rejected_calls_tenant_not_configured`	Counter	Calls rejected due to missing tenant config
`rejected_calls_instructions_missing`	Counter	Calls rejected due to missing instructions
`instructions_db_errors`	Counter	Database errors when fetching instructions
`fallback_instructions_used`	Counter	Times fallback instructions were used
`started_calls`	Counter	Calls that successfully started sessions
`ended_calls`	Counter	Calls that completed or terminated
`failed_calls`	Counter	Calls that ended with errors
`referred_calls`	Counter	Calls transferred to another destination
`minutes_processed`	Float	Total call minutes processed

Accessing Metrics

Global Metrics

Get system-wide metrics across all tenants:

GET /metrics

Response:

{
  "active_calls": 12,
  "accepted_calls": 1543,
  "rejected_calls_capacity": 45,
  "rejected_calls_tenant_not_configured": 8,
  "rejected_calls_instructions_missing": 2,
  "instructions_db_errors": 1,
  "fallback_instructions_used": 1,
  "started_calls": 1490,
  "ended_calls": 1478,
  "failed_calls": 5,
  "referred_calls": 23,
  "minutes_processed": 3842.5
}

Per-Tenant Metrics

Get metrics for a specific tenant:

GET /metrics?tenant_id=acme-corp

Response:

{
  "active_calls": 3,
  "accepted_calls": 456,
  "rejected_calls_capacity": 12,
  "rejected_calls_tenant_not_configured": 0,
  "rejected_calls_instructions_missing": 0,
  "instructions_db_errors": 0,
  "fallback_instructions_used": 0,
  "started_calls": 444,
  "ended_calls": 441,
  "failed_calls": 1,
  "referred_calls": 8,
  "minutes_processed": 1147.2
}

Metrics Implementation

LiveMetricsStore Architecture

The store maintains two data structures (live_store.py:14):

class LiveMetricsStore:
    def __init__(self):
        # Call-level gates for deduplication
        self._call_gates: dict[str, CallGatesMetrics] = {}
        
        # Per-tenant metric stores
        self.metric_store: dict[str, MetricStore] = {}
        
        # Global metrics stored as tenant "__global__"
        self.metric_store["__global__"] = MetricStore()
        
        # Lock for thread safety
        self._lock = asyncio.Lock()

Call Gates

Prevents double-counting metrics for the same call (live_store.py:8):

@dataclass
class CallGatesMetrics:
    accepted: bool = False
    started: bool = False
    ended: bool = False
    ended_at: float | None = None

When recording metrics, gates are checked:

async def record_accept(self, tenant_id: str, call_id: str) -> None:
    async with self._lock:
        if self._call_gates[call_id].accepted:
            return  # Already counted
        
        self.metric_store[tenant_id].accepted_calls += 1
        self.metric_store["__global__"].accepted_calls += 1
        
        self._call_gates[call_id].accepted = True

Recording Metrics

Metrics are recorded at key points in the call lifecycle:

1. Call Accepted

Location: openai_webhook.py:553

await metrics_store.record_accept(tenant_id=tenant_id, call_id=call_id)

Increments:

accepted_calls (tenant and global)
Sets call_gates.accepted = True

2. Call Started

Location: Call manager when session begins

await metrics_store.record_start(tenant_id=tenant_id, call_id=call_id)

Increments:

started_calls (tenant and global)
active_calls (tenant and global)
Sets call_gates.started = True

3. Call Rejected - Capacity

Location: openai_webhook.py:309

await metrics_store.record_reject_capacity(
    call_id=call_id,
    tenant_id=tenant_id,
)

Increments:

rejected_calls_capacity (tenant and global)
Sets call_gates.ended = True

4. Call Rejected - Tenant Not Configured

Location: openai_webhook.py:227, 348, 465

await metrics_store.record_reject_tenant_not_configured(
    call_id=call_id,
    tenant_id=tenant_id,
)

Increments:

rejected_calls_tenant_not_configured (tenant and global)

5. Call Rejected - Instructions Missing

Location: openai_webhook.py:370

await metrics_store.record_reject_instructions_missing(
    call_id=call_id,
    tenant_id=tenant_id,
)

Increments:

rejected_calls_instructions_missing (tenant and global)

6. Instructions DB Error

Location: openai_webhook.py:392, 411, 438

await metrics_store.record_instructions_db_error(
    call_id=call_id,
    tenant_id=tenant_id,
)

Increments:

instructions_db_errors (tenant and global)

Note: Does NOT set ended gate (call may proceed with fallback)

7. Fallback Instructions Used

Location: openai_webhook.py:456

await metrics_store.record_fallback_instructions_used(tenant_id=tenant_id)

Increments:

fallback_instructions_used (tenant and global)

8. Call Ended

Location: openai_webhook.py:649, 668

await metrics_store.record_end(
    tenant_id=tenant_id,
    call_id=call_id,
    end_reason=EndReason.ERROR  # or COMPLETED, HANGUP, REFERRED
)

Increments:

ended_calls (tenant and global)
Decrements active_calls if call was started
Increments failed_calls if end_reason == ERROR
Increments referred_calls if end_reason == REFERRED
Sets call_gates.ended = True

9. Minutes Processed

Location: Call manager after call completes

await metrics_store.record_minutes_processed(
    tenant_id=tenant_id,
    minutes=call_duration_minutes,
)

Increments:

minutes_processed (tenant and global)

End Reasons

Defined in src/models/metrics/store.py:15:

class EndReason(str, Enum):
    COMPLETED = "completed"  # Normal call completion
    HANGUP = "hangup"        # User hung up
    REFERRED = "referred"     # Call transferred
    ERROR = "error"          # Call failed with error

Snapshot API

Get thread-safe metric snapshot (live_store.py:210):

metrics_store = request.app.state.metrics_store

# Global snapshot
global_metrics = await metrics_store.snapshot(tenant_id=None)

# Tenant snapshot
tenant_metrics = await metrics_store.snapshot(tenant_id="acme-corp")

Returns a copy to prevent mutation during use.

Call Gates Pruning

Old call gates are pruned to prevent memory leaks (live_store.py:220):

await metrics_store.prune_call_gates_older_than(age_seconds=3600)

Removes gates for calls that ended more than 1 hour ago. Recommended: Run pruning periodically:

import asyncio

async def prune_periodically():
    while True:
        await asyncio.sleep(1800)  # Every 30 minutes
        await metrics_store.prune_call_gates_older_than(3600)

Monitoring Best Practices

1. Track Active Calls

Monitor for capacity planning:

active_calls = metrics["active_calls"]
max_concurrent_calls = settings.max_concurrent_calls

utilization = (active_calls / max_concurrent_calls) * 100
# Alert if utilization > 80%

2. Monitor Rejection Rates

High rejection rates indicate issues:

total_calls = accepted_calls + rejected_calls_capacity + \
              rejected_calls_tenant_not_configured + \
              rejected_calls_instructions_missing

rejection_rate = (total_rejected / total_calls) * 100
# Alert if rejection_rate > 5%

3. Track Failure Rate

Monitor call quality:

failure_rate = (failed_calls / ended_calls) * 100
# Alert if failure_rate > 1%

4. Monitor DB Health

Track database errors:

if instructions_db_errors > 0 or fallback_instructions_used > 0:
    # Alert: Database issues detected

5. Track Average Call Duration

avg_duration = minutes_processed / ended_calls
# Monitor for anomalies

Metric Export

For long-term storage and analysis, export metrics periodically:

import asyncio
import json
from datetime import datetime

async def export_metrics():
    while True:
        # Export global metrics
        global_metrics = await metrics_store.snapshot(tenant_id=None)
        
        # Export per-tenant metrics
        tenant_metrics = {}
        for tenant_id in tenant_ids:
            tenant_metrics[tenant_id] = await metrics_store.snapshot(
                tenant_id=tenant_id
            )
        
        # Write to time-series database, logs, etc.
        timestamp = datetime.utcnow().isoformat()
        with open(f"metrics-{timestamp}.json", "w") as f:
            json.dump({
                "timestamp": timestamp,
                "global": global_metrics,
                "tenants": tenant_metrics,
            }, f)
        
        await asyncio.sleep(60)  # Export every minute

Alerting Recommendations

Metric	Threshold	Alert
`active_calls / max_concurrent_calls`	> 80%	High capacity utilization
`rejected_calls_capacity`	Increasing	Capacity limits too low
`failed_calls / ended_calls`	> 1%	High failure rate
`instructions_db_errors`	> 0	Database connectivity issues
`fallback_instructions_used`	> 0	Serving degraded experience
`rejected_calls_tenant_not_configured`	> 0	Tenant provisioning issues

Thread Safety

All metric operations are protected by asyncio.Lock (live_store.py:19):

async with self._lock:
    # Atomic read-modify-write operations
    self.metric_store[tenant_id].accepted_calls += 1

This ensures accurate counts in concurrent environments.

Performance Considerations

In-Memory Storage

Metrics are stored in-memory for fast access. For production:

Periodically export to persistent storage
Prune call gates to prevent memory growth
Monitor memory usage if tracking many tenants

Lock Contention

All metric updates acquire a lock. For high-throughput scenarios:

Keep metric operations fast (simple increments)
Avoid blocking I/O inside lock context
Consider batching metric updates

Debugging Metrics

Enable debug logging to trace metric updates:

LOG_LEVEL=DEBUG

Add debug logging in metric methods:

log_event(
    logging.DEBUG,
    "metric_recorded",
    f"Recorded accept for {call_id}",
    tenant_id=tenant_id,
    metric="accepted_calls",
)

Development

Operations

Overview

Available Metrics

Metric Definitions

Accessing Metrics

Global Metrics

Per-Tenant Metrics

Metrics Implementation

LiveMetricsStore Architecture

Call Gates

Recording Metrics

1. Call Accepted

2. Call Started

3. Call Rejected - Capacity

4. Call Rejected - Tenant Not Configured

5. Call Rejected - Instructions Missing

6. Instructions DB Error

7. Fallback Instructions Used

8. Call Ended

9. Minutes Processed

End Reasons

Snapshot API

Call Gates Pruning

Monitoring Best Practices

1. Track Active Calls

2. Monitor Rejection Rates

3. Track Failure Rate

4. Monitor DB Health

5. Track Average Call Duration

Metric Export

Alerting Recommendations

Thread Safety

Performance Considerations

In-Memory Storage

Lock Contention

Debugging Metrics

Build docs developers (and LLMs) love

Development

Operations

​Overview

​Available Metrics

​Metric Definitions

​Accessing Metrics

​Global Metrics

​Per-Tenant Metrics

​Metrics Implementation

​LiveMetricsStore Architecture

​Call Gates

​Recording Metrics

​1. Call Accepted

​2. Call Started

​3. Call Rejected - Capacity

​4. Call Rejected - Tenant Not Configured

​5. Call Rejected - Instructions Missing

​6. Instructions DB Error

​7. Fallback Instructions Used

​8. Call Ended

​9. Minutes Processed

​End Reasons

​Snapshot API

​Call Gates Pruning

​Monitoring Best Practices

​1. Track Active Calls

​2. Monitor Rejection Rates

​3. Track Failure Rate

​4. Monitor DB Health

​5. Track Average Call Duration

​Metric Export

​Alerting Recommendations

​Thread Safety

​Performance Considerations

​In-Memory Storage

​Lock Contention

​Debugging Metrics

Build docs developers (and LLMs) love

Overview

Available Metrics

Metric Definitions

Accessing Metrics

Global Metrics

Per-Tenant Metrics

Metrics Implementation

LiveMetricsStore Architecture

Call Gates

Recording Metrics

1. Call Accepted

2. Call Started

3. Call Rejected - Capacity

4. Call Rejected - Tenant Not Configured

5. Call Rejected - Instructions Missing

6. Instructions DB Error

7. Fallback Instructions Used

8. Call Ended

9. Minutes Processed

End Reasons

Snapshot API

Call Gates Pruning

Monitoring Best Practices

1. Track Active Calls

2. Monitor Rejection Rates

3. Track Failure Rate

4. Monitor DB Health

5. Track Average Call Duration

Metric Export

Alerting Recommendations

Thread Safety

Performance Considerations

In-Memory Storage

Lock Contention

Debugging Metrics