Runtime Reliability - Codex Multi-Auth

Overview

Codex Multi-Auth provides production-grade runtime reliability features that keep your authentication seamless even under adverse conditions like rate limits, network errors, and token expiry.

Live Account Sync

Reload account state without restarting your editor or process.

File System Watching

Uses fs.watch to detect account storage changes in real-time with debounced reload.

Polling Fallback

Polls file mtime every 2 seconds for platforms where fs.watch is unreliable (Windows).

Zero Downtime

Reloads accounts in background without interrupting active requests or streams.

Concurrency Safe

Prevents concurrent reloads with in-flight request queuing.

How It Works

const liveSync = new LiveAccountSync(async () => {
  // Reload callback: reloads account manager from disk
  await accountManager.reload();
});

// Start watching storage file
await liveSync.syncToPath('/path/to/accounts.json');

Debounce: Changes are debounced by 250ms to batch rapid writes. Polling Interval: 2 seconds (configurable via pollIntervalMs).

Use Cases

Multi-Instance Sync: Keep multiple editor windows in sync
External CLI Updates: Reflect codex auth login changes immediately
Team Workflows: Share account updates via version control (with encrypted tokens)
CI/CD: Reload accounts after secret injection

Monitoring

const snapshot = liveSync.getSnapshot();
console.log(snapshot);
// {
//   path: '~/.codex/multi-auth/openai-codex-accounts.json',
//   running: true,
//   lastKnownMtimeMs: 1709481234567,
//   lastSyncAt: 1709481238901,
//   reloadCount: 5,
//   errorCount: 0
// }

Proactive Token Refresh

Refresh OAuth tokens before they expire to prevent mid-request failures.

Refresh Guardian

// Default: refresh 5 minutes before expiry
const DEFAULT_PROACTIVE_BUFFER_MS = 5 * 60 * 1000;

// Check if account needs proactive refresh
if (shouldRefreshProactively(account, bufferMs)) {
  await proactiveRefreshAccount(account);
}

Refresh Strategy

Buffer Window: 5-minute default (configurable via tokenRefreshSkewMs)
Parallel Refresh: Refreshes multiple accounts concurrently
Queued Deduplication: Uses refresh queue to prevent duplicate refresh requests
Failure Handling: Logs failures but doesn’t block request flow

Bulk Refresh

// Refresh all expiring accounts
const results = await refreshExpiringAccounts(accounts);

for (const [index, result] of results) {
  if (result.reason === 'success') {
    console.log(`Account ${index}: refreshed successfully`);
  } else if (result.reason === 'failed') {
    console.log(`Account ${index}: refresh failed`);
  }
}

Summary Logging:

Proactively refreshing 3 account(s)
Proactive refresh complete: 3 total, 2 succeeded, 1 failed

Benefits

Reduces auth failures during long-running requests
Improves UX with seamless token rotation
Works alongside reactive refresh in fetch pipeline
No configuration required (enabled by default)

Failure Policy

Unified retry and failover decisions for network errors, auth failures, and rate limits.

Policy Decision Tree

type FailureAction = 'retry' | 'rotate' | 'fail';

// Network errors → retry with backoff
if (isNetworkError(error)) {
  return { action: 'retry', backoffMs: 1000 };
}

// Auth failures → rotate to next account
if (error.statusCode === 401) {
  return { action: 'rotate', reason: 'auth-failure' };
}

// Rate limits → rotate with cooldown
if (error.statusCode === 429) {
  return { 
    action: 'rotate', 
    cooldownMs: parseRetryAfter(error.headers),
    reason: 'rate-limit'
  };
}

// Fatal errors → fail fast
if (error.statusCode === 400) {
  return { action: 'fail', reason: 'client-error' };
}

Retry Categories

Error Type	Status Code	Action	Backoff
Network timeout	-	Retry	Exponential (1s, 2s, 4s)
Connection refused	-	Retry	Exponential (1s, 2s, 4s)
DNS failure	-	Retry	Exponential (1s, 2s, 4s)
Auth failure	401	Rotate	Immediate
Rate limit	429	Rotate	Parse `Retry-After` header
Server error	5xx	Rotate	Immediate
Client error	400, 403, 404	Fail	None

Cooldown Management

type CooldownReason = 'auth-failure' | 'network-error' | 'rate-limit';

interface AccountMetadata {
  coolingDownUntil?: number;    // Timestamp when cooldown expires
  cooldownReason?: CooldownReason;
}

Cooldown Durations:

Auth Failure: 60 seconds (hard failure cooldown)
Network Error: 30 seconds (soft retry cooldown)
Rate Limit: Parse from Retry-After header or default 60 seconds

Cooldown Behavior:

if (account.coolingDownUntil && account.coolingDownUntil > Date.now()) {
  // Skip account during selection
  continue;
}

Rate Limit Backoff

Exponential backoff with jitter for retry attempts.

Backoff Algorithm

const baseDelay = 1000;  // 1 second
const maxDelay = 32000;  // 32 seconds
const jitterFactor = 0.1;  // ±10% randomization

function calculateBackoff(attempt: number): number {
  const exponential = Math.min(baseDelay * Math.pow(2, attempt), maxDelay);
  const jitter = exponential * jitterFactor * (Math.random() * 2 - 1);
  return Math.floor(exponential + jitter);
}

Attempt Progression:

Attempt 1: ~1000ms ± 10%
Attempt 2: ~2000ms ± 10%
Attempt 3: ~4000ms ± 10%
Attempt 4: ~8000ms ± 10%
Attempt 5: ~16000ms ± 10%
Attempt 6+: ~32000ms ± 10% (capped)

Retry-After Header

Respects server-provided retry hints:

// Parse Retry-After header
const retryAfter = response.headers.get('Retry-After');
if (retryAfter) {
  // Numeric: seconds until retry
  if (/^\d+$/.test(retryAfter)) {
    return parseInt(retryAfter, 10) * 1000;
  }
  // HTTP-date: parse and calculate delta
  const retryDate = new Date(retryAfter);
  return Math.max(0, retryDate.getTime() - Date.now());
}

Stream Failover

Recover from stalled SSE streams with automatic failover.

Stall Detection

// Timeout if no data received for 30 seconds
const STREAM_STALL_TIMEOUT_MS = 30_000;

let lastDataTimestamp = Date.now();
stream.on('data', (chunk) => {
  lastDataTimestamp = Date.now();
  processChunk(chunk);
});

const stallTimer = setInterval(() => {
  if (Date.now() - lastDataTimestamp > STREAM_STALL_TIMEOUT_MS) {
    stream.abort();
    initiateFailover();
  }
}, 5000);

Failover Strategy

Recovery Steps

Detect Stall: No data received for 30 seconds
Abort Stream: Close stalled connection
Account Rotation: Switch to next healthy account
Resume Request: Retry from last successful chunk
State Reconstruction: Rebuild partial response if possible
Fallback: Return partial content or error if unrecoverable

Partial Content Recovery

interface StreamState {
  chunksReceived: number;
  lastCompleteMessage: string;
  partialBuffer: string;
}

// Resume after failover
if (state.lastCompleteMessage) {
  // Continue from last complete SSE message
  yield state.lastCompleteMessage;
}

Session Affinity

Reduce account thrash by maintaining session-to-account affinity.

Affinity Cache

const sessionAffinity = new Map<string, number>();  // sessionId → accountIndex

// Sticky account for conversation
function selectAccount(sessionId?: string): number {
  if (sessionId && sessionAffinity.has(sessionId)) {
    const affinityIndex = sessionAffinity.get(sessionId);
    if (isAccountHealthy(affinityIndex)) {
      return affinityIndex;  // Reuse same account
    }
  }
  
  // Select new account if no affinity or account unhealthy
  const newIndex = selectHealthyAccount();
  if (sessionId) {
    sessionAffinity.set(sessionId, newIndex);
  }
  return newIndex;
}

Benefits:

Reduces auth header changes mid-conversation
Improves quota tracking accuracy
Minimizes account switching overhead

Eviction: Affinity cleared when account fails or becomes unhealthy.

Circuit Breaker

Isolate failing accounts to prevent cascade failures.

Breaker States

type CircuitState = 'closed' | 'open' | 'half-open';

interface CircuitBreaker {
  state: CircuitState;
  failureCount: number;
  lastFailureTime: number;
  nextRetryTime: number;
}

State Transitions:

Closed: Normal operation, requests allowed
Open: Failure threshold exceeded, fast-fail all requests
Half-Open: Test request allowed after timeout, auto-close on success

Thresholds

const FAILURE_THRESHOLD = 5;        // Open circuit after 5 failures
const OPEN_DURATION_MS = 60_000;    // Stay open for 60 seconds
const HALF_OPEN_DURATION_MS = 5_000; // Test for 5 seconds before closing

Integration

if (circuitBreaker.state === 'open') {
  // Skip account, try next
  continue;
}

try {
  const result = await makeRequest(account);
  circuitBreaker.recordSuccess();
  return result;
} catch (error) {
  circuitBreaker.recordFailure();
  throw error;
}

Observability

Runtime telemetry for monitoring reliability features:

interface RuntimeMetrics {
  liveSync: {
    reloadCount: number;
    errorCount: number;
    lastSyncAt: number;
  };
  proactiveRefresh: {
    refreshCount: number;
    successCount: number;
    failureCount: number;
  };
  failover: {
    rotationCount: number;
    retryCount: number;
    failCount: number;
  };
  cooldowns: {
    activeCount: number;
    totalCooldownTime: number;
  };
}

Access Metrics:

const metrics = await codexManager.getMetrics();
console.log(JSON.stringify(metrics, null, 2));

Best Practices

Reliability Recommendations

Enable Live Sync: Keep liveAccountSync: true for multi-instance setups
Monitor Cooldowns: High cooldown rates indicate account or network issues
Proactive Refresh: Use default 5-minute buffer unless latency-sensitive
Respect Rate Limits: Don’t override cooldown timers manually
Session Affinity: Enable for conversational workloads to reduce churn
Circuit Breakers: Isolate chronically failing accounts with enabled: false
Logs: Monitor lib/logger.ts output for failure patterns

tokenRefreshSkewMs - Proactive refresh buffer (default: 5 minutes)
liveAccountSync - Enable live file watching (default: true)
maxRetryAttempts - Maximum retry attempts per request (default: 3)
cooldownDurationMs - Default cooldown duration (default: 60 seconds)

codex auth check - View active cooldowns and rate limits
codex auth forecast - See account availability with wait times
codex auth doctor - Diagnose reliability issues

Getting Started

Core Concepts

Guides

Features

​Overview

​Live Account Sync

File System Watching

Polling Fallback

Zero Downtime

Concurrency Safe

​How It Works

​Use Cases

​Monitoring

​Proactive Token Refresh

​Refresh Guardian

Refresh Strategy

​Bulk Refresh

​Benefits

​Failure Policy

​Policy Decision Tree

​Retry Categories

​Cooldown Management

​Rate Limit Backoff

​Backoff Algorithm

​Retry-After Header

​Stream Failover

​Stall Detection

​Failover Strategy

Recovery Steps

​Partial Content Recovery

​Session Affinity

​Affinity Cache

​Circuit Breaker

​Breaker States

​Thresholds

​Integration

​Observability

​Best Practices