Capacity Management

Overview

The system implements two-tier capacity management to control concurrent calls:

Global capacity limit: Maximum concurrent calls across all tenants
Per-tenant capacity limit: Maximum concurrent calls for a single tenant

Capacity gating occurs in the webhook handler before accepting calls, ensuring the system operates within defined resource constraints.

Configuration

Global Capacity Limit

Set the maximum concurrent calls globally via environment variable:

MAX_CONCURRENT_CALLS=100

This setting is defined in src/core/settings.py:84:

max_concurrent_calls: int = Field(
    default=100, validation_alias="MAX_CONCURRENT_CALLS"
)

Per-Tenant Capacity Limit

By default, the per-tenant limit inherits the global limit. You can configure a separate per-tenant limit if needed:

# In settings.py, add:
max_concurrent_calls_per_tenant: int = Field(
    default=50, validation_alias="MAX_CONCURRENT_CALLS_PER_TENANT"
)

How Capacity Gating Works

The capacity gating logic is implemented in src/apps/calls/api/v1/endpoints/openai_webhook.py:237-297. Here’s how it works:

1. Calculate Current Usage

When an incoming call arrives, the system calculates:

# Active calls currently in progress
tenant_active = call_manager.active_count_by_tenant(tenant_id)
global_active = call_manager.active_count()

# Pending calls being processed but not yet started
tenant_pending = len(request.app.state.pending_by_tenant.get(tenant_id, set()))
global_pending = len(request.app.state.pending_call_ids)

# Total in-use capacity
tenant_in_use = tenant_active + tenant_pending
global_in_use = global_active + global_pending

2. Check Capacity Limits

The system rejects calls when either limit is reached:

reject_capacity = (tenant_in_use >= tenant_limit) or (
    global_in_use >= global_limit
)

3. Reserve or Reject

If capacity is available, the call is marked as pending:

if not reject_capacity:
    # Reserve a slot for this tenant
    request.app.state.pending_call_ids.add(call_id)
    request.app.state.pending_tenant_by_call_id[call_id] = tenant_id
    request.app.state.pending_by_tenant.setdefault(tenant_id, set()).add(call_id)

If capacity is exhausted, the call is rejected:

if reject_capacity:
    response = await openai_calls_service.reject_call(
        call_id, idempotency_key=f"reject_{webhook_id}"
    )
    await metrics_store.record_reject_capacity(
        call_id=call_id,
        tenant_id=tenant_id,
    )

Capacity State Management

The system maintains three capacity-related states:

Pending Calls

Calls that have been accepted but not yet started:

pending_call_ids: Set of all pending call IDs
pending_tenant_by_call_id: Map of call_id → tenant_id
pending_by_tenant: Map of tenant_id → set of call_ids

Accepted Calls

Calls that have been accepted (tracked for deduplication):

request.app.state.accepted_call_ids[call_id] = time.time()

Accepted call IDs are pruned every hour:

request.app.state.accepted_call_ids = {
    cid: ts
    for cid, ts in request.app.state.accepted_call_ids.items()
    if now - ts < 3600  # 1 hour
}

Active Calls

Managed by CallManager, representing calls in active sessions.

Releasing Capacity

Pending capacity is released in these scenarios:

1. Call Session Starts Successfully

See openai_webhook.py:555-563:

async with request.app.state.capacity_lock:
    request.app.state.accepted_call_ids[call_id] = time.time()
    request.app.state.pending_call_ids.discard(call_id)
    request.app.state.pending_tenant_by_call_id.pop(call_id, None)
    tenant_pending = request.app.state.pending_by_tenant.get(tenant_id)
    if tenant_pending:
        tenant_pending.discard(call_id)

2. Call Rejected or Fails to Start

The _release_pending_capacity_state function (lines 39-49) cleans up:

async def _release_pending_capacity_state(request: Request, call_id: str) -> None:
    async with request.app.state.capacity_lock:
        request.app.state.pending_call_ids.discard(call_id)
        
        tenant = request.app.state.pending_tenant_by_call_id.pop(call_id, None)
        if tenant:
            tenant_pending = request.app.state.pending_by_tenant.get(tenant)
            if tenant_pending:
                tenant_pending.discard(call_id)
                if not tenant_pending:
                    request.app.state.pending_by_tenant.pop(tenant, None)

3. Call Ends

See openai_webhook.py:694-696:

await _release_pending_capacity_state(request, call_id)
async with request.app.state.capacity_lock:
    request.app.state.accepted_call_ids.pop(call_id, None)

Monitoring Capacity

Track capacity rejections using metrics:

# Check total capacity rejections
GET /metrics
GET /metrics?tenant_id=<tenant_id>

Key metrics:

rejected_calls_capacity: Count of calls rejected due to capacity limits
active_calls: Current active call count
accepted_calls: Total accepted calls

See the Metrics Guide for details.

Capacity Rejection Response

When a call is rejected due to capacity, the webhook returns:

{
  "ok": true,
  "rejected": "capacity"
}

The OpenAI service receives the rejection and the caller hears a busy signal.

Best Practices

1. Set Conservative Limits

Start with lower limits and increase based on observed performance:

# Start conservative
MAX_CONCURRENT_CALLS=50

# Monitor metrics and gradually increase
MAX_CONCURRENT_CALLS=100

2. Monitor Rejection Rates

High rejection rates indicate insufficient capacity:

rejection_rate = rejected_calls_capacity / (accepted_calls + rejected_calls_capacity)

If rejection rate exceeds 5-10%, consider increasing limits or scaling infrastructure.

3. Implement Tenant-Specific Limits

For multi-tenant deployments, prevent single tenants from monopolizing capacity:

max_concurrent_calls_per_tenant = max_concurrent_calls // 2  # 50% max per tenant

4. Use Async Locks Properly

All capacity state modifications use capacity_lock to prevent race conditions:

async with request.app.state.capacity_lock:
    # Atomic capacity checks and updates
    if call_id in request.app.state.pending_call_ids:
        return  # Already pending

Thread Safety

The capacity gating implementation is thread-safe:

All capacity state reads/writes protected by capacity_lock
Atomic check-and-reserve operations
Idempotent cleanup operations

Troubleshooting

Calls Rejected Despite Low Active Count

Check pending calls:

len(request.app.state.pending_call_ids)  # May be high

Pending calls consume capacity until they start or fail.

Capacity Not Released

Check for exceptions in _start_call_session that prevent cleanup. Review logs:

grep "call_session_start_failed" logs.json

Deduplication False Positives

If legitimate calls are marked as duplicates:

# Check dedup cleanup interval (30 minutes default)
if now - request.app.state.last_dedup_cleanup > 1800:
    # Prune old webhook IDs

Reduce the interval if needed.

Development

Operations

Overview

Configuration

Global Capacity Limit

Per-Tenant Capacity Limit

How Capacity Gating Works

1. Calculate Current Usage

2. Check Capacity Limits

3. Reserve or Reject

Capacity State Management

Pending Calls

Accepted Calls

Active Calls

Releasing Capacity

1. Call Session Starts Successfully

2. Call Rejected or Fails to Start

3. Call Ends

Monitoring Capacity

Capacity Rejection Response

Best Practices

1. Set Conservative Limits

2. Monitor Rejection Rates

3. Implement Tenant-Specific Limits

4. Use Async Locks Properly

Thread Safety

Troubleshooting

Calls Rejected Despite Low Active Count

Capacity Not Released

Deduplication False Positives

Build docs developers (and LLMs) love

Development

Operations

​Overview

​Configuration

​Global Capacity Limit

​Per-Tenant Capacity Limit

​How Capacity Gating Works

​1. Calculate Current Usage

​2. Check Capacity Limits

​3. Reserve or Reject

​Capacity State Management

​Pending Calls

​Accepted Calls

​Active Calls

​Releasing Capacity

​1. Call Session Starts Successfully

​2. Call Rejected or Fails to Start

​3. Call Ends

​Monitoring Capacity

​Capacity Rejection Response

​Best Practices

​1. Set Conservative Limits

​2. Monitor Rejection Rates

​3. Implement Tenant-Specific Limits

​4. Use Async Locks Properly

​Thread Safety

​Troubleshooting

​Calls Rejected Despite Low Active Count

​Capacity Not Released

​Deduplication False Positives

Build docs developers (and LLMs) love

Overview

Configuration

Global Capacity Limit

Per-Tenant Capacity Limit

How Capacity Gating Works

1. Calculate Current Usage

2. Check Capacity Limits

3. Reserve or Reject

Capacity State Management

Pending Calls

Accepted Calls

Active Calls

Releasing Capacity

1. Call Session Starts Successfully

2. Call Rejected or Fails to Start

3. Call Ends

Monitoring Capacity

Capacity Rejection Response

Best Practices

1. Set Conservative Limits

2. Monitor Rejection Rates

3. Implement Tenant-Specific Limits

4. Use Async Locks Properly

Thread Safety

Troubleshooting

Calls Rejected Despite Low Active Count

Capacity Not Released

Deduplication False Positives