Skip to main content

Overview

The system tracks real-time call metrics using LiveMetricsStore (src/apps/calls/metrics/live_store.py). Metrics are maintained per-tenant and globally.

Available Metrics

All metrics are defined in MetricStore model (src/models/metrics/store.py:22):
class MetricStore(BaseModel):
    active_calls: int = 0
    accepted_calls: int = 0
    rejected_calls_capacity: int = 0
    rejected_calls_tenant_not_configured: int = 0
    rejected_calls_instructions_missing: int = 0
    instructions_db_errors: int = 0
    fallback_instructions_used: int = 0
    started_calls: int = 0
    ended_calls: int = 0
    failed_calls: int = 0
    referred_calls: int = 0
    minutes_processed: float = 0.0

Metric Definitions

MetricTypeDescription
active_callsCounterCurrent number of active call sessions
accepted_callsCounterTotal calls accepted (cumulative)
rejected_calls_capacityCounterCalls rejected due to capacity limits
rejected_calls_tenant_not_configuredCounterCalls rejected due to missing tenant config
rejected_calls_instructions_missingCounterCalls rejected due to missing instructions
instructions_db_errorsCounterDatabase errors when fetching instructions
fallback_instructions_usedCounterTimes fallback instructions were used
started_callsCounterCalls that successfully started sessions
ended_callsCounterCalls that completed or terminated
failed_callsCounterCalls that ended with errors
referred_callsCounterCalls transferred to another destination
minutes_processedFloatTotal call minutes processed

Accessing Metrics

Global Metrics

Get system-wide metrics across all tenants:
GET /metrics
Response:
{
  "active_calls": 12,
  "accepted_calls": 1543,
  "rejected_calls_capacity": 45,
  "rejected_calls_tenant_not_configured": 8,
  "rejected_calls_instructions_missing": 2,
  "instructions_db_errors": 1,
  "fallback_instructions_used": 1,
  "started_calls": 1490,
  "ended_calls": 1478,
  "failed_calls": 5,
  "referred_calls": 23,
  "minutes_processed": 3842.5
}

Per-Tenant Metrics

Get metrics for a specific tenant:
GET /metrics?tenant_id=acme-corp
Response:
{
  "active_calls": 3,
  "accepted_calls": 456,
  "rejected_calls_capacity": 12,
  "rejected_calls_tenant_not_configured": 0,
  "rejected_calls_instructions_missing": 0,
  "instructions_db_errors": 0,
  "fallback_instructions_used": 0,
  "started_calls": 444,
  "ended_calls": 441,
  "failed_calls": 1,
  "referred_calls": 8,
  "minutes_processed": 1147.2
}

Metrics Implementation

LiveMetricsStore Architecture

The store maintains two data structures (live_store.py:14):
class LiveMetricsStore:
    def __init__(self):
        # Call-level gates for deduplication
        self._call_gates: dict[str, CallGatesMetrics] = {}
        
        # Per-tenant metric stores
        self.metric_store: dict[str, MetricStore] = {}
        
        # Global metrics stored as tenant "__global__"
        self.metric_store["__global__"] = MetricStore()
        
        # Lock for thread safety
        self._lock = asyncio.Lock()

Call Gates

Prevents double-counting metrics for the same call (live_store.py:8):
@dataclass
class CallGatesMetrics:
    accepted: bool = False
    started: bool = False
    ended: bool = False
    ended_at: float | None = None
When recording metrics, gates are checked:
async def record_accept(self, tenant_id: str, call_id: str) -> None:
    async with self._lock:
        if self._call_gates[call_id].accepted:
            return  # Already counted
        
        self.metric_store[tenant_id].accepted_calls += 1
        self.metric_store["__global__"].accepted_calls += 1
        
        self._call_gates[call_id].accepted = True

Recording Metrics

Metrics are recorded at key points in the call lifecycle:

1. Call Accepted

Location: openai_webhook.py:553
await metrics_store.record_accept(tenant_id=tenant_id, call_id=call_id)
Increments:
  • accepted_calls (tenant and global)
  • Sets call_gates.accepted = True

2. Call Started

Location: Call manager when session begins
await metrics_store.record_start(tenant_id=tenant_id, call_id=call_id)
Increments:
  • started_calls (tenant and global)
  • active_calls (tenant and global)
  • Sets call_gates.started = True

3. Call Rejected - Capacity

Location: openai_webhook.py:309
await metrics_store.record_reject_capacity(
    call_id=call_id,
    tenant_id=tenant_id,
)
Increments:
  • rejected_calls_capacity (tenant and global)
  • Sets call_gates.ended = True

4. Call Rejected - Tenant Not Configured

Location: openai_webhook.py:227, 348, 465
await metrics_store.record_reject_tenant_not_configured(
    call_id=call_id,
    tenant_id=tenant_id,
)
Increments:
  • rejected_calls_tenant_not_configured (tenant and global)

5. Call Rejected - Instructions Missing

Location: openai_webhook.py:370
await metrics_store.record_reject_instructions_missing(
    call_id=call_id,
    tenant_id=tenant_id,
)
Increments:
  • rejected_calls_instructions_missing (tenant and global)

6. Instructions DB Error

Location: openai_webhook.py:392, 411, 438
await metrics_store.record_instructions_db_error(
    call_id=call_id,
    tenant_id=tenant_id,
)
Increments:
  • instructions_db_errors (tenant and global)
Note: Does NOT set ended gate (call may proceed with fallback)

7. Fallback Instructions Used

Location: openai_webhook.py:456
await metrics_store.record_fallback_instructions_used(tenant_id=tenant_id)
Increments:
  • fallback_instructions_used (tenant and global)

8. Call Ended

Location: openai_webhook.py:649, 668
await metrics_store.record_end(
    tenant_id=tenant_id,
    call_id=call_id,
    end_reason=EndReason.ERROR  # or COMPLETED, HANGUP, REFERRED
)
Increments:
  • ended_calls (tenant and global)
  • Decrements active_calls if call was started
  • Increments failed_calls if end_reason == ERROR
  • Increments referred_calls if end_reason == REFERRED
  • Sets call_gates.ended = True

9. Minutes Processed

Location: Call manager after call completes
await metrics_store.record_minutes_processed(
    tenant_id=tenant_id,
    minutes=call_duration_minutes,
)
Increments:
  • minutes_processed (tenant and global)

End Reasons

Defined in src/models/metrics/store.py:15:
class EndReason(str, Enum):
    COMPLETED = "completed"  # Normal call completion
    HANGUP = "hangup"        # User hung up
    REFERRED = "referred"     # Call transferred
    ERROR = "error"          # Call failed with error

Snapshot API

Get thread-safe metric snapshot (live_store.py:210):
metrics_store = request.app.state.metrics_store

# Global snapshot
global_metrics = await metrics_store.snapshot(tenant_id=None)

# Tenant snapshot
tenant_metrics = await metrics_store.snapshot(tenant_id="acme-corp")
Returns a copy to prevent mutation during use.

Call Gates Pruning

Old call gates are pruned to prevent memory leaks (live_store.py:220):
await metrics_store.prune_call_gates_older_than(age_seconds=3600)
Removes gates for calls that ended more than 1 hour ago. Recommended: Run pruning periodically:
import asyncio

async def prune_periodically():
    while True:
        await asyncio.sleep(1800)  # Every 30 minutes
        await metrics_store.prune_call_gates_older_than(3600)

Monitoring Best Practices

1. Track Active Calls

Monitor for capacity planning:
active_calls = metrics["active_calls"]
max_concurrent_calls = settings.max_concurrent_calls

utilization = (active_calls / max_concurrent_calls) * 100
# Alert if utilization > 80%

2. Monitor Rejection Rates

High rejection rates indicate issues:
total_calls = accepted_calls + rejected_calls_capacity + \
              rejected_calls_tenant_not_configured + \
              rejected_calls_instructions_missing

rejection_rate = (total_rejected / total_calls) * 100
# Alert if rejection_rate > 5%

3. Track Failure Rate

Monitor call quality:
failure_rate = (failed_calls / ended_calls) * 100
# Alert if failure_rate > 1%

4. Monitor DB Health

Track database errors:
if instructions_db_errors > 0 or fallback_instructions_used > 0:
    # Alert: Database issues detected

5. Track Average Call Duration

avg_duration = minutes_processed / ended_calls
# Monitor for anomalies

Metric Export

For long-term storage and analysis, export metrics periodically:
import asyncio
import json
from datetime import datetime

async def export_metrics():
    while True:
        # Export global metrics
        global_metrics = await metrics_store.snapshot(tenant_id=None)
        
        # Export per-tenant metrics
        tenant_metrics = {}
        for tenant_id in tenant_ids:
            tenant_metrics[tenant_id] = await metrics_store.snapshot(
                tenant_id=tenant_id
            )
        
        # Write to time-series database, logs, etc.
        timestamp = datetime.utcnow().isoformat()
        with open(f"metrics-{timestamp}.json", "w") as f:
            json.dump({
                "timestamp": timestamp,
                "global": global_metrics,
                "tenants": tenant_metrics,
            }, f)
        
        await asyncio.sleep(60)  # Export every minute

Alerting Recommendations

MetricThresholdAlert
active_calls / max_concurrent_calls> 80%High capacity utilization
rejected_calls_capacityIncreasingCapacity limits too low
failed_calls / ended_calls> 1%High failure rate
instructions_db_errors> 0Database connectivity issues
fallback_instructions_used> 0Serving degraded experience
rejected_calls_tenant_not_configured> 0Tenant provisioning issues

Thread Safety

All metric operations are protected by asyncio.Lock (live_store.py:19):
async with self._lock:
    # Atomic read-modify-write operations
    self.metric_store[tenant_id].accepted_calls += 1
This ensures accurate counts in concurrent environments.

Performance Considerations

In-Memory Storage

Metrics are stored in-memory for fast access. For production:
  1. Periodically export to persistent storage
  2. Prune call gates to prevent memory growth
  3. Monitor memory usage if tracking many tenants

Lock Contention

All metric updates acquire a lock. For high-throughput scenarios:
  • Keep metric operations fast (simple increments)
  • Avoid blocking I/O inside lock context
  • Consider batching metric updates

Debugging Metrics

Enable debug logging to trace metric updates:
LOG_LEVEL=DEBUG
Add debug logging in metric methods:
log_event(
    logging.DEBUG,
    "metric_recorded",
    f"Recorded accept for {call_id}",
    tenant_id=tenant_id,
    metric="accepted_calls",
)

Build docs developers (and LLMs) love