Skip to main content

Overview

CLI Proxy API intelligently routes requests across multiple credentials to maximize availability and balance load. The routing system handles:
  • Credential selection - Choosing which account to use
  • Load balancing - Distributing requests evenly
  • Quota management - Handling rate limits and daily quotas
  • Automatic failover - Retrying with different credentials
  • Model aliasing - Mapping model names

Routing Strategies

Two built-in strategies control credential selection:

Round-Robin (Default)

Distributes requests evenly across all available credentials:
config.yaml
routing:
  strategy: "round-robin"
How it works:
Request 1 → Account A
Request 2 → Account B
Request 3 → Account C
Request 4 → Account A  (cycles back)
Request 5 → Account B
...
// RoundRobinSelector provides provider-scoped round-robin selection.
type RoundRobinSelector struct {
    mu      sync.Mutex
    cursors map[string]int
    maxKeys int
}

func (s *RoundRobinSelector) Pick(ctx context.Context, 
    provider, model string, opts Options, auths []*Auth) (*Auth, error) {
    
    // Filter ready auths
    ready := filterReady(auths)
    if len(ready) == 0 {
        return nil, ErrNoCredentials
    }
    
    // Get cursor for this provider+model
    key := provider + "/" + model
    s.mu.Lock()
    cursor := s.cursors[key]
    s.cursors[key] = (cursor + 1) % len(ready)
    s.mu.Unlock()
    
    return ready[cursor], nil
}
Best for:
  • Even distribution across accounts
  • Maximizing total quota usage
  • Avoiding concentration on single account

Fill-First

Uses the first credential until it hits quota, then moves to next:
config.yaml
routing:
  strategy: "fill-first"
How it works:
Request 1-100  → Account A
Request 101    → Account A (quota exceeded)
Request 102-200 → Account B
Request 201    → Account B (quota exceeded)
Request 202-300 → Account C
...
// FillFirstSelector selects the first available credential.
// This "burns" one account before moving to the next.
type FillFirstSelector struct{}

func (FillFirstSelector) Pick(ctx context.Context, 
    provider, model string, opts Options, auths []*Auth) (*Auth, error) {
    
    // Filter ready auths and sort by priority
    ready := filterReady(auths)
    if len(ready) == 0 {
        return nil, ErrNoCredentials
    }
    
    sortByPriority(ready)
    return ready[0], nil
}
Best for:
  • Staggering rolling-window limits (e.g., chat message caps)
  • Minimizing active accounts
  • Preserving specific accounts for peak times

Credential States

Each credential can be in one of four states:
type scheduledState int

const (
    scheduledStateReady      // Available for requests
    scheduledStateCooldown   // Quota exceeded, waiting
    scheduledStateBlocked    // Temporarily disabled
    scheduledStateDisabled   // Permanently disabled
)

Ready

Credential is available and will be selected by routing strategy.

Cooldown

Credential exceeded quota and is temporarily blocked:
sdk/cliproxy/auth/conductor.go
const (
    quotaBackoffBase = time.Second
    quotaBackoffMax  = 30 * time.Minute
)
Cooldown behavior:
  1. Detect quota error (HTTP 429 or provider-specific message)
  2. Calculate backoff using exponential strategy:
    backoff = min(quotaBackoffBase * 2^failures, quotaBackoffMax)
    
  3. Enter cooldown for calculated duration
  4. Return to ready after cooldown expires
Example cooldown sequence:
Failure 1 → 1 second cooldown
Failure 2 → 2 seconds cooldown
Failure 3 → 4 seconds cooldown
Failure 4 → 8 seconds cooldown
...
Failure N → 30 minutes cooldown (max)

Blocked

Manually blocked via Management API or attributes.

Disabled

Permanently disabled (e.g., deleted auth file).

Priority-Based Selection

Credentials can have priority levels:
~/.cli-proxy-api/[email protected]
{
  "access_token": "...",
  "attributes": {
    "priority": "10"  // Higher = selected first
  }
}
~/.cli-proxy-api/[email protected]
{
  "access_token": "...",
  "attributes": {
    "priority": "1"
  }
}
Selection order:
  1. Priority 10 accounts selected first
  2. Priority 1 accounts used as fallback
  3. Priority 0 (default) used last
Round-robin operates within each priority level:
Priority 10: Account A, Account B
Priority 1:  Account C, Account D

Request flow:
A → B → A → B → ... (until all priority 10 hit quota)
C → D → C → D → ... (fallback to priority 1)

Model Prefix Routing

Force specific credentials using model prefixes:

Configuring Prefixes

config.yaml
gemini-api-key:
  - api-key: "AIzaSyPersonal..."
    prefix: "personal"
  - api-key: "AIzaSyWork..."
    prefix: "work"
  - api-key: "AIzaSyTeam..."
    prefix: "team"

Using Prefixes

# Use personal account
curl -X POST http://localhost:8317/v1/chat/completions \
  -H "Authorization: Bearer your-api-key" \
  -d '{
    "model": "personal/gemini-2.5-pro",
    "messages": [...]
  }'

# Use work account
curl -X POST http://localhost:8317/v1/chat/completions \
  -H "Authorization: Bearer your-api-key" \
  -d '{
    "model": "work/gemini-2.5-pro",
    "messages": [...]
  }'

Force Prefix Mode

Require prefixes for all requests:
config.yaml
force-model-prefix: true
When enabled, unprefixed requests only use credentials without a prefix.

Model Aliasing

Map client model names to provider model names:

Global OAuth Aliases

config.yaml
oauth-model-alias:
  gemini-cli:
    - name: "gemini-2.5-pro"          # Upstream name
      alias: "g2.5p"                  # Client alias
      fork: false                     # Replace original
  claude:
    - name: "claude-sonnet-4-5-20250929"
      alias: "cs4.5"
      fork: true                      # Keep both
  codex:
    - name: "gpt-5"
      alias: "g5"
fork: false (default):
Client sees: "g2.5p"
Client does NOT see: "gemini-2.5-pro"
Request to "g2.5p" → upstream "gemini-2.5-pro"
fork: true:
Client sees: "cs4.5" AND "claude-sonnet-4-5-20250929"
Request to either → upstream "claude-sonnet-4-5-20250929"

API Key Aliases

config.yaml
gemini-api-key:
  - api-key: "AIzaSy..."
    models:
      - name: "gemini-2.5-flash"      # Upstream name
        alias: "gemini-flash"         # Client alias
      - name: "gemini-2.5-pro"
        alias: "gemini-pro"

codex-api-key:
  - api-key: "sk-atSM..."
    models:
      - name: "gpt-5-codex"           # Upstream name
        alias: "codex-latest"         # Client alias

Model Pools (Internal Failover)

Map multiple upstream models to the same alias:
config.yaml
openai-compatibility:
  - name: "openrouter"
    models:
      # All map to "best-model" alias
      - name: "anthropic/claude-3.5-sonnet"
        alias: "best-model"
      - name: "google/gemini-pro"
        alias: "best-model"
      - name: "openai/gpt-4"
        alias: "best-model"
Behavior:
  1. Client requests best-model
  2. Round-robin selects: claude-3.5-sonnet
  3. If fails before producing output → retry with gemini-pro
  4. If fails again → retry with gpt-4
  5. If all fail → return error

Model Exclusion

Hide models from the model list:

OAuth Exclusions

config.yaml
oauth-excluded-models:
  gemini-cli:
    - "gemini-2.5-pro"     # Exact match
    - "gemini-2.5-*"       # Prefix wildcard
    - "*-preview"          # Suffix wildcard
    - "*flash*"            # Substring wildcard
  claude:
    - "claude-3-5-haiku-20241022"
    - "*-thinking"
  codex:
    - "gpt-5-codex-mini"
    - "*-mini"

API Key Exclusions

config.yaml
gemini-api-key:
  - api-key: "AIzaSy..."
    excluded-models:
      - "gemini-2.5-pro"
      - "*-preview"
Wildcard patterns:
  • model-name - Exact match
  • prefix-* - Matches prefix-anything
  • *-suffix - Matches anything-suffix
  • *substring* - Matches any-substring-here

Automatic Failover

When a request fails, CLI Proxy API automatically retries:

Retry Configuration

config.yaml
request-retry: 3              # Retry failed requests 3 times
max-retry-credentials: 5      # Try up to 5 different credentials
max-retry-interval: 30        # Wait max 30 seconds for cooldown

Retry Logic

// Retry controls request retry behavior
type Manager struct {
    requestRetry        atomic.Int32  // Number of retries
    maxRetryCredentials atomic.Int32  // Max credentials to try
    maxRetryInterval    atomic.Int64  // Max cooldown wait (seconds)
}
Retry flow:
  1. Attempt 1: Credential A → Fails (quota exceeded)
  2. Attempt 2: Credential B → Fails (503 error)
  3. Attempt 3: Credential C → Fails (timeout)
  4. Attempt 4: Credential D → Success ✓
Retry conditions: Retries occur for these HTTP status codes:
  • 403 - Forbidden
  • 408 - Request Timeout
  • 429 - Too Many Requests (quota)
  • 500 - Internal Server Error
  • 502 - Bad Gateway
  • 503 - Service Unavailable
  • 504 - Gateway Timeout

Quota Failover

Special handling for quota-related errors:
config.yaml
quota-exceeded:
  switch-project: true         # Try other credentials
  switch-preview-model: true   # Try preview models if available
switch-project: When true, quota errors trigger immediate retry with next credential:
Request → Account A → Quota exceeded
       → Account B → Quota exceeded
       → Account C → Success
switch-preview-model: When true, falls back to preview models:
Request: gemini-2.5-pro → Quota exceeded
      → gemini-2.5-pro-preview → Success

Multi-Provider Routing

Some models are available from multiple providers:
config.yaml
# Gemini from multiple sources
gemini-api-key:
  - api-key: "AIzaSyOfficial..."  # Official API

vertex-api-key:
  - api-key: "vk-relay1..."       # Relay service 1
  - api-key: "vk-relay2..."       # Relay service 2

# All provide "gemini-2.5-pro"
Selection order:
  1. Filter to credentials offering the requested model
  2. Apply routing strategy within available credentials
  3. Round-robin across providers (not just accounts)

Request Metadata

Control routing via request metadata:

Pin to Specific Credential

opts := executor.Options{
    Metadata: map[string]any{
        executor.PinnedAuthMetadataKey: "auth-id-123",
    },
}
Forces the request to use credential with ID auth-id-123.

Track Selected Credential

var selectedAuthID string
opts := executor.Options{
    Metadata: map[string]any{
        executor.SelectedAuthCallbackMetadataKey: func(authID string) {
            selectedAuthID = authID
        },
    },
}
Callback receives the ID of the selected credential.

Model Registry

The registry dynamically tracks which credentials can serve which models:
type ModelRegistration struct {
    Info  *ModelInfo
    Count int  // Number of credentials offering this model
    
    // Quota tracking
    QuotaExceededClients map[string]*time.Time
    
    // Provider breakdown
    Providers map[string]int  // provider → count
    
    // Suspended credentials
    SuspendedClients map[string]string
}
Dynamic visibility:
3 credentials offer "gemini-2.5-pro"

Model appears in /v1/models

2 credentials hit quota

Model still visible (1 remaining)

Last credential hits quota

Model hidden from /v1/models

Streaming Bootstrap Retries

For streaming requests, retries happen before the first byte is sent:
config.yaml
streaming:
  keepalive-seconds: 15     # Send blank lines every 15s
  bootstrap-retries: 1      # Retry once before streaming starts
Bootstrap retry flow:
  1. Attempt 1: Credential A → Error before streaming
  2. Attempt 2: Credential B → Starts streaming → Success
Once streaming starts, no more retries (client already receiving data).

Performance Considerations

Scheduler Optimization

The scheduler pre-builds selection views:
sdk/cliproxy/auth/scheduler.go
// Per-model scheduler tracks ready credentials
type modelScheduler struct {
    entries         map[string]*scheduledAuth
    priorityOrder   []int
    readyByPriority map[int]*readyBucket  // Pre-sorted
    blocked         cooldownQueue
}
Benefits:
  • O(1) credential selection (no sorting on hot path)
  • Efficient priority handling
  • Fast cooldown management

Concurrency

Routing decisions are lock-free for read paths:
sdk/cliproxy/auth/conductor.go
type Manager struct {
    mu    sync.RWMutex
    auths map[string]*Auth  // Read-locked during selection
}
Multiple requests select credentials concurrently without contention.

Debugging Routing

Enable debug logging:
config.yaml
debug: true
Logs include:
  • Credential selection decisions
  • Cooldown state changes
  • Retry attempts
  • Provider routing
Example log:
[DEBUG] Selecting credential for model=gemini-2.5-pro provider=gemini-cli
[DEBUG] Ready credentials: 3
[DEBUG] Selected credential: auth-id-abc123 (priority=10)
[DEBUG] Credential auth-id-xyz789 entered cooldown for 4s

Best Practices

Round-robin maximizes total quota usage by spreading load across all accounts evenly.
Fill-first prevents hitting multiple accounts’ daily message limits simultaneously.
Keep low-priority accounts as emergency backup when primary accounts hit quota.
Assign each team a prefixed credential pool to prevent quota conflicts.
Set max-retry-credentials to prevent excessive retry attempts that delay errors.

Next Steps

Configuration

Configure routing behavior

Model Mappings

Set up model aliases and pools

Providers

Learn about provider-specific features

Management API

Monitor routing via API

Build docs developers (and LLMs) love