Load Balancing - Portkey AI Gateway

Load balancing allows you to distribute LLM requests across multiple providers or API keys to optimize for availability, cost, performance, and rate limit management.

How Load Balancing Works

The gateway uses weighted random selection to distribute requests:

Assign weights to each target (default: 1)
Calculate total weight across all targets
Generate random value between 0 and total weight
Select target by iterating and subtracting weights

Configuration

Basic Load Balancing

Distribute requests equally across targets:

{
  "strategy": {
    "mode": "loadbalance"
  },
  "targets": [
    {
      "provider": "openai",
      "api_key": "sk-1"
    },
    {
      "provider": "openai",
      "api_key": "sk-2"
    },
    {
      "provider": "openai",
      "api_key": "sk-3"
    }
  ]
}

Each target receives approximately 33% of requests.

Weighted Load Balancing

Control the distribution with explicit weights:

{
  "strategy": {
    "mode": "loadbalance"
  },
  "targets": [
    {
      "provider": "openai",
      "api_key": "sk-1",
      "weight": 0.7
    },
    {
      "provider": "anthropic",
      "api_key": "sk-ant-1",
      "weight": 0.2
    },
    {
      "provider": "groq",
      "api_key": "gsk-1",
      "weight": 0.1
    }
  ]
}

Distribution:

OpenAI: 70% of requests
Anthropic: 20% of requests
Groq: 10% of requests

Cross-Provider Load Balancing

Balance across different providers:

{
  "strategy": {
    "mode": "loadbalance"
  },
  "targets": [
    {
      "provider": "openai",
      "api_key": "sk-...",
      "weight": 0.5,
      "override_params": {
        "model": "gpt-4o-mini"
      }
    },
    {
      "provider": "anthropic",
      "api_key": "sk-ant-...",
      "weight": 0.5,
      "override_params": {
        "model": "claude-3-5-haiku-20241022"
      }
    }
  ]
}

When load balancing across providers, ensure the models have similar capabilities to maintain consistent user experience.

Weight Selection Algorithm

The gateway implements weighted random selection:

function selectProviderByWeight(providers: Options[]): Options {
  // Default weight to 1 if not specified
  providers = providers.map(provider => ({
    ...provider,
    weight: provider.weight ?? 1
  }));
  
  // Calculate total weight
  const totalWeight = providers.reduce(
    (sum, provider) => sum + provider.weight,
    0
  );
  
  // Generate random value
  let randomWeight = Math.random() * totalWeight;
  
  // Select provider
  for (const [index, provider] of providers.entries()) {
    if (randomWeight < provider.weight) {
      return { ...provider, index };
    }
    randomWeight -= provider.weight;
  }
  
  throw new Error('No provider selected, please check the weights');
}

Source: src/handlers/handlerUtils.ts:204-231

Weight Normalization

Weights don’t need to sum to 1.0 - they can be any positive numbers:

// These are equivalent:
{
  "targets": [
    { "weight": 0.5 },
    { "weight": 0.5 }
  ]
}

{
  "targets": [
    { "weight": 1 },
    { "weight": 1 }
  ]
}

{
  "targets": [
    { "weight": 50 },
    { "weight": 50 }
  ]
}

The gateway normalizes weights internally based on their proportions.

Use Cases

Rate Limit Management

Distribute load across multiple API keys to avoid rate limits:

{
  "strategy": { "mode": "loadbalance" },
  "targets": [
    { "provider": "openai", "api_key": "sk-1" },
    { "provider": "openai", "api_key": "sk-2" },
    { "provider": "openai", "api_key": "sk-3" },
    { "provider": "openai", "api_key": "sk-4" },
    { "provider": "openai", "api_key": "sk-5" }
  ]
}

With 5 API keys, you can handle 5x the rate limit of a single key.

Cost Optimization

Route to cheaper providers while maintaining a fallback:

{
  "strategy": { "mode": "loadbalance" },
  "targets": [
    {
      "provider": "groq",
      "api_key": "gsk-...",
      "weight": 0.8,
      "override_params": { "model": "llama-3.3-70b-versatile" }
    },
    {
      "provider": "openai",
      "api_key": "sk-...",
      "weight": 0.2,
      "override_params": { "model": "gpt-4o-mini" }
    }
  ]
}

Send 80% of traffic to cheaper Groq, 20% to OpenAI.

A/B Testing Models

Test different models with controlled traffic splits:

{
  "strategy": { "mode": "loadbalance" },
  "targets": [
    {
      "provider": "openai",
      "api_key": "sk-...",
      "weight": 0.9,
      "override_params": { "model": "gpt-4o-mini" }
    },
    {
      "provider": "openai",
      "api_key": "sk-...",
      "weight": 0.1,
      "override_params": { "model": "gpt-4o" }
    }
  ]
}

Test the new model with 10% of traffic before full rollout.

Geographic Distribution

Route to region-specific endpoints:

{
  "strategy": { "mode": "loadbalance" },
  "targets": [
    {
      "provider": "azure-openai",
      "api_key": "...",
      "resource_name": "us-east-openai",
      "weight": 0.5
    },
    {
      "provider": "azure-openai",
      "api_key": "...",
      "resource_name": "eu-west-openai",
      "weight": 0.5
    }
  ]
}

Combining with Other Strategies

Load Balance + Fallback

Balance across primary keys, fallback to secondary provider:

{
  "strategy": { "mode": "fallback" },
  "targets": [
    {
      "strategy": { "mode": "loadbalance" },
      "targets": [
        { "provider": "openai", "api_key": "sk-1", "weight": 0.5 },
        { "provider": "openai", "api_key": "sk-2", "weight": 0.5 }
      ]
    },
    {
      "provider": "anthropic",
      "api_key": "sk-ant-..."
    }
  ]
}

Flow:

Load balance between sk-1 and sk-2
If both fail, fall back to Anthropic

Load Balance + Retry

Add retries to each load-balanced target:

{
  "strategy": { "mode": "loadbalance" },
  "retry": { "attempts": 3 },
  "targets": [
    { "provider": "openai", "api_key": "sk-1" },
    { "provider": "openai", "api_key": "sk-2" },
    { "provider": "openai", "api_key": "sk-3" }
  ]
}

Each selected target will retry up to 3 times before failing.

Load Balance + Cache

{
  "cache": { "mode": "simple", "max_age": 3600000 },
  "strategy": { "mode": "loadbalance" },
  "targets": [
    { "provider": "openai", "api_key": "sk-1", "weight": 0.5 },
    { "provider": "openai", "api_key": "sk-2", "weight": 0.5 }
  ]
}

Cache hits skip load balancing entirely, saving both latency and costs.

Implementation Details

Weight Processing

When entering loadbalance mode:

case StrategyModes.LOADBALANCE:
  // Set default weight of 1 for targets without weights
  currentTarget.targets.forEach((t: Options) => {
    if (t.weight === undefined) {
      t.weight = 1;
    }
  });
  
  // Calculate total weight
  let totalWeight = currentTarget.targets.reduce(
    (sum: number, provider: any) => sum + provider.weight,
    0
  );
  
  // Select provider by weight
  const selectedProvider = selectProviderByWeight(
    currentTarget.targets
  );

Source: src/handlers/handlerUtils.ts:693-712

Request Flow

Circuit Breaker Integration

Load balancing integrates with circuit breakers to exclude unhealthy targets:

const isHandlingCircuitBreaker = currentInheritedConfig.id;
if (isHandlingCircuitBreaker) {
  const healthyTargets = (currentTarget.targets || [])
    .map((t: any, index: number) => ({
      ...t,
      originalIndex: index,
    }))
    .filter((t: any) => !t.isOpen);  // Filter out open (unhealthy) targets
  
  if (healthyTargets.length) {
    currentTarget.targets = healthyTargets;
  }
}

Source: src/handlers/handlerUtils.ts:646-658 Unhealthy targets are automatically removed from the load balancing pool.

Monitoring and Metrics

Request Distribution

With proper weights, distribution should match configured percentages over time:

# For weight configuration:
# Provider A: 0.7, Provider B: 0.3

# Expected over 1000 requests:
# Provider A: ~700 requests
# Provider B: ~300 requests

Actual Distribution

Monitor actual distribution to ensure weights are working:

const distribution = {
  'provider-1': 0,
  'provider-2': 0,
  'provider-3': 0
};

// Track each request
const selectedProvider = selectProviderByWeight(targets);
distribution[selectedProvider.provider]++;

Small sample sizes may show variance from expected distribution. Statistical convergence occurs over hundreds or thousands of requests.

Best Practices

Start with Equal Weights

Begin with equal distribution and adjust based on performance data.

Monitor Provider Health

Track error rates and latency per provider to inform weight adjustments.

Use Fallbacks Too

Combine load balancing with fallbacks for maximum reliability.

Test Weight Changes

Gradually adjust weights and monitor impact before large changes.

Performance Characteristics

Selection Overhead

Weight calculation: O(n) where n = number of targets
Random selection: O(n) worst case
Total overhead: < 0.1ms for typical configs

Memory Usage

Minimal per-request overhead:

Weight array: 8 bytes per target
Random value: 8 bytes
Selected index: 4 bytes

Distribution Quality

The gateway uses Math.random() which provides:

Uniform distribution
Sufficient randomness for load balancing
Fast execution (< 0.01ms)

Common Patterns

Primary-Secondary Split

{
  "targets": [
    { "provider": "openai", "api_key": "primary", "weight": 0.9 },
    { "provider": "anthropic", "api_key": "secondary", "weight": 0.1 }
  ]
}

Use secondary provider to validate primary or test alternatives.

Multi-Key Rate Limit Avoidance

{
  "targets": [
    { "provider": "openai", "api_key": "key-1" },
    { "provider": "openai", "api_key": "key-2" },
    { "provider": "openai", "api_key": "key-3" },
    { "provider": "openai", "api_key": "key-4" }
  ]
}

Each key gets 25% of traffic, avoiding rate limits.

Cost-Performance Trade-off

{
  "targets": [
    { "provider": "groq", "weight": 0.6, "override_params": { "model": "llama-3" } },
    { "provider": "openai", "weight": 0.4, "override_params": { "model": "gpt-4o" } }
  ]
}

Balance cost (Groq) with quality (OpenAI).

Next Steps

Routing

Learn about other routing strategies like fallback and conditional.

Configs

Master the complete configuration system.

Retries

Add retries to load-balanced requests.

Providers

Understand provider capabilities for load balancing.

Getting Started

Core Concepts

Features

MCP Gateway

Deployment

​How Load Balancing Works

​Configuration

​Basic Load Balancing

​Weighted Load Balancing

​Cross-Provider Load Balancing

​Weight Selection Algorithm

​Weight Normalization

​Use Cases

​Rate Limit Management

​Cost Optimization

​A/B Testing Models

​Geographic Distribution

​Combining with Other Strategies

​Load Balance + Fallback

​Load Balance + Retry

​Load Balance + Cache

​Implementation Details

​Weight Processing

​Request Flow

​Circuit Breaker Integration

​Monitoring and Metrics

​Request Distribution

​Actual Distribution

​Best Practices

Start with Equal Weights

Monitor Provider Health

Use Fallbacks Too

Test Weight Changes

​Performance Characteristics

​Selection Overhead

​Memory Usage

​Distribution Quality

​Common Patterns

​Primary-Secondary Split

​Multi-Key Rate Limit Avoidance

​Cost-Performance Trade-off

​Next Steps

Routing

Configs

Retries

Providers

Build docs developers (and LLMs) love

How Load Balancing Works

Configuration

Basic Load Balancing

Weighted Load Balancing

Cross-Provider Load Balancing

Weight Selection Algorithm

Weight Normalization

Use Cases

Rate Limit Management

Cost Optimization

A/B Testing Models

Geographic Distribution

Combining with Other Strategies

Load Balance + Fallback

Load Balance + Retry

Load Balance + Cache

Implementation Details

Weight Processing

Request Flow

Circuit Breaker Integration

Monitoring and Metrics

Request Distribution

Actual Distribution

Best Practices

Performance Characteristics

Selection Overhead

Memory Usage

Distribution Quality

Common Patterns

Primary-Secondary Split

Multi-Key Rate Limit Avoidance

Cost-Performance Trade-off

Next Steps