Skip to main content

Overview

The system implements two-tier capacity management to control concurrent calls:
  • Global capacity limit: Maximum concurrent calls across all tenants
  • Per-tenant capacity limit: Maximum concurrent calls for a single tenant
Capacity gating occurs in the webhook handler before accepting calls, ensuring the system operates within defined resource constraints.

Configuration

Global Capacity Limit

Set the maximum concurrent calls globally via environment variable:
MAX_CONCURRENT_CALLS=100
This setting is defined in src/core/settings.py:84:
max_concurrent_calls: int = Field(
    default=100, validation_alias="MAX_CONCURRENT_CALLS"
)

Per-Tenant Capacity Limit

By default, the per-tenant limit inherits the global limit. You can configure a separate per-tenant limit if needed:
# In settings.py, add:
max_concurrent_calls_per_tenant: int = Field(
    default=50, validation_alias="MAX_CONCURRENT_CALLS_PER_TENANT"
)

How Capacity Gating Works

The capacity gating logic is implemented in src/apps/calls/api/v1/endpoints/openai_webhook.py:237-297. Here’s how it works:

1. Calculate Current Usage

When an incoming call arrives, the system calculates:
# Active calls currently in progress
tenant_active = call_manager.active_count_by_tenant(tenant_id)
global_active = call_manager.active_count()

# Pending calls being processed but not yet started
tenant_pending = len(request.app.state.pending_by_tenant.get(tenant_id, set()))
global_pending = len(request.app.state.pending_call_ids)

# Total in-use capacity
tenant_in_use = tenant_active + tenant_pending
global_in_use = global_active + global_pending

2. Check Capacity Limits

The system rejects calls when either limit is reached:
reject_capacity = (tenant_in_use >= tenant_limit) or (
    global_in_use >= global_limit
)

3. Reserve or Reject

If capacity is available, the call is marked as pending:
if not reject_capacity:
    # Reserve a slot for this tenant
    request.app.state.pending_call_ids.add(call_id)
    request.app.state.pending_tenant_by_call_id[call_id] = tenant_id
    request.app.state.pending_by_tenant.setdefault(tenant_id, set()).add(call_id)
If capacity is exhausted, the call is rejected:
if reject_capacity:
    response = await openai_calls_service.reject_call(
        call_id, idempotency_key=f"reject_{webhook_id}"
    )
    await metrics_store.record_reject_capacity(
        call_id=call_id,
        tenant_id=tenant_id,
    )

Capacity State Management

The system maintains three capacity-related states:

Pending Calls

Calls that have been accepted but not yet started:
  • pending_call_ids: Set of all pending call IDs
  • pending_tenant_by_call_id: Map of call_id → tenant_id
  • pending_by_tenant: Map of tenant_id → set of call_ids

Accepted Calls

Calls that have been accepted (tracked for deduplication):
request.app.state.accepted_call_ids[call_id] = time.time()
Accepted call IDs are pruned every hour:
request.app.state.accepted_call_ids = {
    cid: ts
    for cid, ts in request.app.state.accepted_call_ids.items()
    if now - ts < 3600  # 1 hour
}

Active Calls

Managed by CallManager, representing calls in active sessions.

Releasing Capacity

Pending capacity is released in these scenarios:

1. Call Session Starts Successfully

See openai_webhook.py:555-563:
async with request.app.state.capacity_lock:
    request.app.state.accepted_call_ids[call_id] = time.time()
    request.app.state.pending_call_ids.discard(call_id)
    request.app.state.pending_tenant_by_call_id.pop(call_id, None)
    tenant_pending = request.app.state.pending_by_tenant.get(tenant_id)
    if tenant_pending:
        tenant_pending.discard(call_id)

2. Call Rejected or Fails to Start

The _release_pending_capacity_state function (lines 39-49) cleans up:
async def _release_pending_capacity_state(request: Request, call_id: str) -> None:
    async with request.app.state.capacity_lock:
        request.app.state.pending_call_ids.discard(call_id)
        
        tenant = request.app.state.pending_tenant_by_call_id.pop(call_id, None)
        if tenant:
            tenant_pending = request.app.state.pending_by_tenant.get(tenant)
            if tenant_pending:
                tenant_pending.discard(call_id)
                if not tenant_pending:
                    request.app.state.pending_by_tenant.pop(tenant, None)

3. Call Ends

See openai_webhook.py:694-696:
await _release_pending_capacity_state(request, call_id)
async with request.app.state.capacity_lock:
    request.app.state.accepted_call_ids.pop(call_id, None)

Monitoring Capacity

Track capacity rejections using metrics:
# Check total capacity rejections
GET /metrics
GET /metrics?tenant_id=<tenant_id>
Key metrics:
  • rejected_calls_capacity: Count of calls rejected due to capacity limits
  • active_calls: Current active call count
  • accepted_calls: Total accepted calls
See the Metrics Guide for details.

Capacity Rejection Response

When a call is rejected due to capacity, the webhook returns:
{
  "ok": true,
  "rejected": "capacity"
}
The OpenAI service receives the rejection and the caller hears a busy signal.

Best Practices

1. Set Conservative Limits

Start with lower limits and increase based on observed performance:
# Start conservative
MAX_CONCURRENT_CALLS=50

# Monitor metrics and gradually increase
MAX_CONCURRENT_CALLS=100

2. Monitor Rejection Rates

High rejection rates indicate insufficient capacity:
rejection_rate = rejected_calls_capacity / (accepted_calls + rejected_calls_capacity)
If rejection rate exceeds 5-10%, consider increasing limits or scaling infrastructure.

3. Implement Tenant-Specific Limits

For multi-tenant deployments, prevent single tenants from monopolizing capacity:
max_concurrent_calls_per_tenant = max_concurrent_calls // 2  # 50% max per tenant

4. Use Async Locks Properly

All capacity state modifications use capacity_lock to prevent race conditions:
async with request.app.state.capacity_lock:
    # Atomic capacity checks and updates
    if call_id in request.app.state.pending_call_ids:
        return  # Already pending

Thread Safety

The capacity gating implementation is thread-safe:
  • All capacity state reads/writes protected by capacity_lock
  • Atomic check-and-reserve operations
  • Idempotent cleanup operations

Troubleshooting

Calls Rejected Despite Low Active Count

Check pending calls:
len(request.app.state.pending_call_ids)  # May be high
Pending calls consume capacity until they start or fail.

Capacity Not Released

Check for exceptions in _start_call_session that prevent cleanup. Review logs:
grep "call_session_start_failed" logs.json

Deduplication False Positives

If legitimate calls are marked as duplicates:
# Check dedup cleanup interval (30 minutes default)
if now - request.app.state.last_dedup_cleanup > 1800:
    # Prune old webhook IDs
Reduce the interval if needed.

Build docs developers (and LLMs) love