Routing & Load Balancing

Overview

CLI Proxy API intelligently routes requests across multiple credentials to maximize availability and balance load. The routing system handles:

Credential selection - Choosing which account to use
Load balancing - Distributing requests evenly
Quota management - Handling rate limits and daily quotas
Automatic failover - Retrying with different credentials
Model aliasing - Mapping model names

Routing Strategies

Two built-in strategies control credential selection:

Round-Robin (Default)

Distributes requests evenly across all available credentials:

config.yaml

routing:
  strategy: "round-robin"

How it works:

Request 1 → Account A
Request 2 → Account B
Request 3 → Account C
Request 4 → Account A  (cycles back)
Request 5 → Account B
...

// RoundRobinSelector provides provider-scoped round-robin selection.
type RoundRobinSelector struct {
    mu      sync.Mutex
    cursors map[string]int
    maxKeys int
}

func (s *RoundRobinSelector) Pick(ctx context.Context, 
    provider, model string, opts Options, auths []*Auth) (*Auth, error) {
    
    // Filter ready auths
    ready := filterReady(auths)
    if len(ready) == 0 {
        return nil, ErrNoCredentials
    }
    
    // Get cursor for this provider+model
    key := provider + "/" + model
    s.mu.Lock()
    cursor := s.cursors[key]
    s.cursors[key] = (cursor + 1) % len(ready)
    s.mu.Unlock()
    
    return ready[cursor], nil
}

Best for:

Even distribution across accounts
Maximizing total quota usage
Avoiding concentration on single account

Fill-First

Uses the first credential until it hits quota, then moves to next:

config.yaml

routing:
  strategy: "fill-first"

How it works:

Request 1-100  → Account A
Request 101    → Account A (quota exceeded)
Request 102-200 → Account B
Request 201    → Account B (quota exceeded)
Request 202-300 → Account C
...

// FillFirstSelector selects the first available credential.
// This "burns" one account before moving to the next.
type FillFirstSelector struct{}

func (FillFirstSelector) Pick(ctx context.Context, 
    provider, model string, opts Options, auths []*Auth) (*Auth, error) {
    
    // Filter ready auths and sort by priority
    ready := filterReady(auths)
    if len(ready) == 0 {
        return nil, ErrNoCredentials
    }
    
    sortByPriority(ready)
    return ready[0], nil
}

Best for:

Staggering rolling-window limits (e.g., chat message caps)
Minimizing active accounts
Preserving specific accounts for peak times

Credential States

Each credential can be in one of four states:

type scheduledState int

const (
    scheduledStateReady      // Available for requests
    scheduledStateCooldown   // Quota exceeded, waiting
    scheduledStateBlocked    // Temporarily disabled
    scheduledStateDisabled   // Permanently disabled
)

Ready

Credential is available and will be selected by routing strategy.

Cooldown

Credential exceeded quota and is temporarily blocked:

sdk/cliproxy/auth/conductor.go

const (
    quotaBackoffBase = time.Second
    quotaBackoffMax  = 30 * time.Minute
)

Cooldown behavior:

Detect quota error (HTTP 429 or provider-specific message)

Calculate backoff using exponential strategy:

backoff = min(quotaBackoffBase * 2^failures, quotaBackoffMax)

Enter cooldown for calculated duration
Return to ready after cooldown expires

Example cooldown sequence:

Failure 1 → 1 second cooldown
Failure 2 → 2 seconds cooldown
Failure 3 → 4 seconds cooldown
Failure 4 → 8 seconds cooldown
...
Failure N → 30 minutes cooldown (max)

Blocked

Manually blocked via Management API or attributes.

Disabled

Permanently disabled (e.g., deleted auth file).

Priority-Based Selection

Credentials can have priority levels:

~/.cli-proxy-api/[email protected]

{
  "access_token": "...",
  "attributes": {
    "priority": "10"  // Higher = selected first
  }
}

~/.cli-proxy-api/[email protected]

{
  "access_token": "...",
  "attributes": {
    "priority": "1"
  }
}

Selection order:

Priority 10 accounts selected first
Priority 1 accounts used as fallback
Priority 0 (default) used last

Round-robin operates within each priority level:

Priority 10: Account A, Account B
Priority 1:  Account C, Account D

Request flow:
A → B → A → B → ... (until all priority 10 hit quota)
C → D → C → D → ... (fallback to priority 1)

Model Prefix Routing

Force specific credentials using model prefixes:

Configuring Prefixes

config.yaml

gemini-api-key:
  - api-key: "AIzaSyPersonal..."
    prefix: "personal"
  - api-key: "AIzaSyWork..."
    prefix: "work"
  - api-key: "AIzaSyTeam..."
    prefix: "team"

Using Prefixes

# Use personal account
curl -X POST http://localhost:8317/v1/chat/completions \
  -H "Authorization: Bearer your-api-key" \
  -d '{
    "model": "personal/gemini-2.5-pro",
    "messages": [...]
  }'

# Use work account
curl -X POST http://localhost:8317/v1/chat/completions \
  -H "Authorization: Bearer your-api-key" \
  -d '{
    "model": "work/gemini-2.5-pro",
    "messages": [...]
  }'

Force Prefix Mode

Require prefixes for all requests:

config.yaml

force-model-prefix: true

When enabled, unprefixed requests only use credentials without a prefix.

Model Aliasing

Map client model names to provider model names:

Global OAuth Aliases

config.yaml

oauth-model-alias:
  gemini-cli:
    - name: "gemini-2.5-pro"          # Upstream name
      alias: "g2.5p"                  # Client alias
      fork: false                     # Replace original
  claude:
    - name: "claude-sonnet-4-5-20250929"
      alias: "cs4.5"
      fork: true                      # Keep both
  codex:
    - name: "gpt-5"
      alias: "g5"

fork: false (default):

Client sees: "g2.5p"
Client does NOT see: "gemini-2.5-pro"
Request to "g2.5p" → upstream "gemini-2.5-pro"

fork: true:

Client sees: "cs4.5" AND "claude-sonnet-4-5-20250929"
Request to either → upstream "claude-sonnet-4-5-20250929"

API Key Aliases

config.yaml

gemini-api-key:
  - api-key: "AIzaSy..."
    models:
      - name: "gemini-2.5-flash"      # Upstream name
        alias: "gemini-flash"         # Client alias
      - name: "gemini-2.5-pro"
        alias: "gemini-pro"

codex-api-key:
  - api-key: "sk-atSM..."
    models:
      - name: "gpt-5-codex"           # Upstream name
        alias: "codex-latest"         # Client alias

Model Pools (Internal Failover)

Map multiple upstream models to the same alias:

config.yaml

openai-compatibility:
  - name: "openrouter"
    models:
      # All map to "best-model" alias
      - name: "anthropic/claude-3.5-sonnet"
        alias: "best-model"
      - name: "google/gemini-pro"
        alias: "best-model"
      - name: "openai/gpt-4"
        alias: "best-model"

Behavior:

Client requests best-model
Round-robin selects: claude-3.5-sonnet
If fails before producing output → retry with gemini-pro
If fails again → retry with gpt-4
If all fail → return error

Model Exclusion

Hide models from the model list:

OAuth Exclusions

config.yaml

oauth-excluded-models:
  gemini-cli:
    - "gemini-2.5-pro"     # Exact match
    - "gemini-2.5-*"       # Prefix wildcard
    - "*-preview"          # Suffix wildcard
    - "*flash*"            # Substring wildcard
  claude:
    - "claude-3-5-haiku-20241022"
    - "*-thinking"
  codex:
    - "gpt-5-codex-mini"
    - "*-mini"

API Key Exclusions

config.yaml

gemini-api-key:
  - api-key: "AIzaSy..."
    excluded-models:
      - "gemini-2.5-pro"
      - "*-preview"

Wildcard patterns:

model-name - Exact match
prefix-* - Matches prefix-anything
*-suffix - Matches anything-suffix
*substring* - Matches any-substring-here

Automatic Failover

When a request fails, CLI Proxy API automatically retries:

Retry Configuration

config.yaml

request-retry: 3              # Retry failed requests 3 times
max-retry-credentials: 5      # Try up to 5 different credentials
max-retry-interval: 30        # Wait max 30 seconds for cooldown

Retry Logic

// Retry controls request retry behavior
type Manager struct {
    requestRetry        atomic.Int32  // Number of retries
    maxRetryCredentials atomic.Int32  // Max credentials to try
    maxRetryInterval    atomic.Int64  // Max cooldown wait (seconds)
}

Retry flow:

Attempt 1: Credential A → Fails (quota exceeded)
Attempt 2: Credential B → Fails (503 error)
Attempt 3: Credential C → Fails (timeout)
Attempt 4: Credential D → Success ✓

Retry conditions: Retries occur for these HTTP status codes:

403 - Forbidden
408 - Request Timeout
429 - Too Many Requests (quota)
500 - Internal Server Error
502 - Bad Gateway
503 - Service Unavailable
504 - Gateway Timeout

Quota Failover

Special handling for quota-related errors:

config.yaml

quota-exceeded:
  switch-project: true         # Try other credentials
  switch-preview-model: true   # Try preview models if available

switch-project: When true, quota errors trigger immediate retry with next credential:

Request → Account A → Quota exceeded
       → Account B → Quota exceeded
       → Account C → Success

switch-preview-model: When true, falls back to preview models:

Request: gemini-2.5-pro → Quota exceeded
      → gemini-2.5-pro-preview → Success

Multi-Provider Routing

Some models are available from multiple providers:

config.yaml

# Gemini from multiple sources
gemini-api-key:
  - api-key: "AIzaSyOfficial..."  # Official API

vertex-api-key:
  - api-key: "vk-relay1..."       # Relay service 1
  - api-key: "vk-relay2..."       # Relay service 2

# All provide "gemini-2.5-pro"

Selection order:

Filter to credentials offering the requested model
Apply routing strategy within available credentials
Round-robin across providers (not just accounts)

Request Metadata

Control routing via request metadata:

Pin to Specific Credential

opts := executor.Options{
    Metadata: map[string]any{
        executor.PinnedAuthMetadataKey: "auth-id-123",
    },
}

Forces the request to use credential with ID auth-id-123.

Track Selected Credential

var selectedAuthID string
opts := executor.Options{
    Metadata: map[string]any{
        executor.SelectedAuthCallbackMetadataKey: func(authID string) {
            selectedAuthID = authID
        },
    },
}

Callback receives the ID of the selected credential.

Model Registry

The registry dynamically tracks which credentials can serve which models:

type ModelRegistration struct {
    Info  *ModelInfo
    Count int  // Number of credentials offering this model
    
    // Quota tracking
    QuotaExceededClients map[string]*time.Time
    
    // Provider breakdown
    Providers map[string]int  // provider → count
    
    // Suspended credentials
    SuspendedClients map[string]string
}

Dynamic visibility:

3 credentials offer "gemini-2.5-pro"
↓
Model appears in /v1/models

2 credentials hit quota
↓
Model still visible (1 remaining)

Last credential hits quota
↓
Model hidden from /v1/models

Streaming Bootstrap Retries

For streaming requests, retries happen before the first byte is sent:

config.yaml

streaming:
  keepalive-seconds: 15     # Send blank lines every 15s
  bootstrap-retries: 1      # Retry once before streaming starts

Bootstrap retry flow:

Attempt 1: Credential A → Error before streaming
Attempt 2: Credential B → Starts streaming → Success

Once streaming starts, no more retries (client already receiving data).

Performance Considerations

Scheduler Optimization

The scheduler pre-builds selection views:

sdk/cliproxy/auth/scheduler.go

// Per-model scheduler tracks ready credentials
type modelScheduler struct {
    entries         map[string]*scheduledAuth
    priorityOrder   []int
    readyByPriority map[int]*readyBucket  // Pre-sorted
    blocked         cooldownQueue
}

Benefits:

O(1) credential selection (no sorting on hot path)
Efficient priority handling
Fast cooldown management

Concurrency

Routing decisions are lock-free for read paths:

sdk/cliproxy/auth/conductor.go

type Manager struct {
    mu    sync.RWMutex
    auths map[string]*Auth  // Read-locked during selection
}

Multiple requests select credentials concurrently without contention.

Debugging Routing

Enable debug logging:

config.yaml

debug: true

Logs include:

Credential selection decisions
Cooldown state changes
Retry attempts
Provider routing

Example log:

[DEBUG] Selecting credential for model=gemini-2.5-pro provider=gemini-cli
[DEBUG] Ready credentials: 3
[DEBUG] Selected credential: auth-id-abc123 (priority=10)
[DEBUG] Credential auth-id-xyz789 entered cooldown for 4s

Best Practices

Use round-robin for even distribution

Round-robin maximizes total quota usage by spreading load across all accounts evenly.

Use fill-first for rolling-window limits

Fill-first prevents hitting multiple accounts’ daily message limits simultaneously.

Set priorities for fallback accounts

Keep low-priority accounts as emergency backup when primary accounts hit quota.

Use prefixes for team isolation

Assign each team a prefixed credential pool to prevent quota conflicts.

Configure retry limits

Set max-retry-credentials to prevent excessive retry attempts that delay errors.

Next Steps

Configuration

Configure routing behavior

Model Mappings

Set up model aliases and pools

Providers

Learn about provider-specific features

Management API

Monitor routing via API

Get Started

Core Concepts

Configuration

OAuth Authentication

Integrations

Deployment

​Overview

​Routing Strategies

​Round-Robin (Default)

​Fill-First

​Credential States

​Ready

​Cooldown

​Blocked

​Disabled

​Priority-Based Selection

​Model Prefix Routing

​Configuring Prefixes

​Using Prefixes

​Force Prefix Mode

​Model Aliasing

​Global OAuth Aliases

​API Key Aliases

​Model Pools (Internal Failover)

​Model Exclusion

​OAuth Exclusions

​API Key Exclusions

​Automatic Failover

​Retry Configuration

​Retry Logic

​Quota Failover

​Multi-Provider Routing

​Request Metadata

​Pin to Specific Credential

​Track Selected Credential

​Model Registry

​Streaming Bootstrap Retries

​Performance Considerations

​Scheduler Optimization

​Concurrency

​Debugging Routing

​Best Practices

​Next Steps

Configuration

Model Mappings

Providers

Management API

Build docs developers (and LLMs) love

Overview

Routing Strategies

Round-Robin (Default)

Fill-First

Credential States

Ready

Cooldown

Blocked

Disabled

Priority-Based Selection

Model Prefix Routing

Configuring Prefixes

Using Prefixes

Force Prefix Mode

Model Aliasing

Global OAuth Aliases

API Key Aliases

Model Pools (Internal Failover)

Model Exclusion

OAuth Exclusions

API Key Exclusions

Automatic Failover

Retry Configuration

Retry Logic

Quota Failover

Multi-Provider Routing

Request Metadata

Pin to Specific Credential

Track Selected Credential

Model Registry

Streaming Bootstrap Retries

Performance Considerations

Scheduler Optimization

Concurrency

Debugging Routing

Best Practices

Next Steps