Load balancing allows you to distribute LLM requests across multiple providers or API keys to optimize for availability, cost, performance, and rate limit management.
How Load Balancing Works
The gateway uses weighted random selection to distribute requests:
Assign weights to each target (default: 1)
Calculate total weight across all targets
Generate random value between 0 and total weight
Select target by iterating and subtracting weights
Configuration
Basic Load Balancing
Distribute requests equally across targets:
{
"strategy" : {
"mode" : "loadbalance"
},
"targets" : [
{
"provider" : "openai" ,
"api_key" : "sk-1"
},
{
"provider" : "openai" ,
"api_key" : "sk-2"
},
{
"provider" : "openai" ,
"api_key" : "sk-3"
}
]
}
Each target receives approximately 33% of requests.
Weighted Load Balancing
Control the distribution with explicit weights:
{
"strategy" : {
"mode" : "loadbalance"
},
"targets" : [
{
"provider" : "openai" ,
"api_key" : "sk-1" ,
"weight" : 0.7
},
{
"provider" : "anthropic" ,
"api_key" : "sk-ant-1" ,
"weight" : 0.2
},
{
"provider" : "groq" ,
"api_key" : "gsk-1" ,
"weight" : 0.1
}
]
}
Distribution:
OpenAI: 70% of requests
Anthropic: 20% of requests
Groq: 10% of requests
Cross-Provider Load Balancing
Balance across different providers:
{
"strategy" : {
"mode" : "loadbalance"
},
"targets" : [
{
"provider" : "openai" ,
"api_key" : "sk-..." ,
"weight" : 0.5 ,
"override_params" : {
"model" : "gpt-4o-mini"
}
},
{
"provider" : "anthropic" ,
"api_key" : "sk-ant-..." ,
"weight" : 0.5 ,
"override_params" : {
"model" : "claude-3-5-haiku-20241022"
}
}
]
}
When load balancing across providers, ensure the models have similar capabilities to maintain consistent user experience.
Weight Selection Algorithm
The gateway implements weighted random selection:
function selectProviderByWeight ( providers : Options []) : Options {
// Default weight to 1 if not specified
providers = providers . map ( provider => ({
... provider ,
weight: provider . weight ?? 1
}));
// Calculate total weight
const totalWeight = providers . reduce (
( sum , provider ) => sum + provider . weight ,
0
);
// Generate random value
let randomWeight = Math . random () * totalWeight ;
// Select provider
for ( const [ index , provider ] of providers . entries ()) {
if ( randomWeight < provider . weight ) {
return { ... provider , index };
}
randomWeight -= provider . weight ;
}
throw new Error ( 'No provider selected, please check the weights' );
}
Source: src/handlers/handlerUtils.ts:204-231
Weight Normalization
Weights don’t need to sum to 1.0 - they can be any positive numbers:
// These are equivalent:
{
"targets" : [
{ "weight" : 0.5 },
{ "weight" : 0.5 }
]
}
{
"targets" : [
{ "weight" : 1 },
{ "weight" : 1 }
]
}
{
"targets" : [
{ "weight" : 50 },
{ "weight" : 50 }
]
}
The gateway normalizes weights internally based on their proportions.
Use Cases
Rate Limit Management
Distribute load across multiple API keys to avoid rate limits:
{
"strategy" : { "mode" : "loadbalance" },
"targets" : [
{ "provider" : "openai" , "api_key" : "sk-1" },
{ "provider" : "openai" , "api_key" : "sk-2" },
{ "provider" : "openai" , "api_key" : "sk-3" },
{ "provider" : "openai" , "api_key" : "sk-4" },
{ "provider" : "openai" , "api_key" : "sk-5" }
]
}
With 5 API keys, you can handle 5x the rate limit of a single key.
Cost Optimization
Route to cheaper providers while maintaining a fallback:
{
"strategy" : { "mode" : "loadbalance" },
"targets" : [
{
"provider" : "groq" ,
"api_key" : "gsk-..." ,
"weight" : 0.8 ,
"override_params" : { "model" : "llama-3.3-70b-versatile" }
},
{
"provider" : "openai" ,
"api_key" : "sk-..." ,
"weight" : 0.2 ,
"override_params" : { "model" : "gpt-4o-mini" }
}
]
}
Send 80% of traffic to cheaper Groq, 20% to OpenAI.
A/B Testing Models
Test different models with controlled traffic splits:
{
"strategy" : { "mode" : "loadbalance" },
"targets" : [
{
"provider" : "openai" ,
"api_key" : "sk-..." ,
"weight" : 0.9 ,
"override_params" : { "model" : "gpt-4o-mini" }
},
{
"provider" : "openai" ,
"api_key" : "sk-..." ,
"weight" : 0.1 ,
"override_params" : { "model" : "gpt-4o" }
}
]
}
Test the new model with 10% of traffic before full rollout.
Geographic Distribution
Route to region-specific endpoints:
{
"strategy" : { "mode" : "loadbalance" },
"targets" : [
{
"provider" : "azure-openai" ,
"api_key" : "..." ,
"resource_name" : "us-east-openai" ,
"weight" : 0.5
},
{
"provider" : "azure-openai" ,
"api_key" : "..." ,
"resource_name" : "eu-west-openai" ,
"weight" : 0.5
}
]
}
Combining with Other Strategies
Load Balance + Fallback
Balance across primary keys, fallback to secondary provider:
{
"strategy" : { "mode" : "fallback" },
"targets" : [
{
"strategy" : { "mode" : "loadbalance" },
"targets" : [
{ "provider" : "openai" , "api_key" : "sk-1" , "weight" : 0.5 },
{ "provider" : "openai" , "api_key" : "sk-2" , "weight" : 0.5 }
]
},
{
"provider" : "anthropic" ,
"api_key" : "sk-ant-..."
}
]
}
Flow:
Load balance between sk-1 and sk-2
If both fail, fall back to Anthropic
Load Balance + Retry
Add retries to each load-balanced target:
{
"strategy" : { "mode" : "loadbalance" },
"retry" : { "attempts" : 3 },
"targets" : [
{ "provider" : "openai" , "api_key" : "sk-1" },
{ "provider" : "openai" , "api_key" : "sk-2" },
{ "provider" : "openai" , "api_key" : "sk-3" }
]
}
Each selected target will retry up to 3 times before failing.
Load Balance + Cache
{
"cache" : { "mode" : "simple" , "max_age" : 3600000 },
"strategy" : { "mode" : "loadbalance" },
"targets" : [
{ "provider" : "openai" , "api_key" : "sk-1" , "weight" : 0.5 },
{ "provider" : "openai" , "api_key" : "sk-2" , "weight" : 0.5 }
]
}
Cache hits skip load balancing entirely, saving both latency and costs.
Implementation Details
Weight Processing
When entering loadbalance mode:
case StrategyModes . LOADBALANCE :
// Set default weight of 1 for targets without weights
currentTarget . targets . forEach (( t : Options ) => {
if ( t . weight === undefined ) {
t . weight = 1 ;
}
});
// Calculate total weight
let totalWeight = currentTarget . targets . reduce (
( sum : number , provider : any ) => sum + provider . weight ,
0
);
// Select provider by weight
const selectedProvider = selectProviderByWeight (
currentTarget . targets
);
Source: src/handlers/handlerUtils.ts:693-712
Request Flow
Circuit Breaker Integration
Load balancing integrates with circuit breakers to exclude unhealthy targets:
const isHandlingCircuitBreaker = currentInheritedConfig . id ;
if ( isHandlingCircuitBreaker ) {
const healthyTargets = ( currentTarget . targets || [])
. map (( t : any , index : number ) => ({
... t ,
originalIndex: index ,
}))
. filter (( t : any ) => ! t . isOpen ); // Filter out open (unhealthy) targets
if ( healthyTargets . length ) {
currentTarget . targets = healthyTargets ;
}
}
Source: src/handlers/handlerUtils.ts:646-658
Unhealthy targets are automatically removed from the load balancing pool.
Monitoring and Metrics
Request Distribution
With proper weights, distribution should match configured percentages over time:
# For weight configuration:
# Provider A: 0.7, Provider B: 0.3
# Expected over 1000 requests:
# Provider A: ~700 requests
# Provider B: ~300 requests
Actual Distribution
Monitor actual distribution to ensure weights are working:
const distribution = {
'provider-1' : 0 ,
'provider-2' : 0 ,
'provider-3' : 0
};
// Track each request
const selectedProvider = selectProviderByWeight ( targets );
distribution [ selectedProvider . provider ] ++ ;
Small sample sizes may show variance from expected distribution. Statistical convergence occurs over hundreds or thousands of requests.
Best Practices
Start with Equal Weights Begin with equal distribution and adjust based on performance data.
Monitor Provider Health Track error rates and latency per provider to inform weight adjustments.
Use Fallbacks Too Combine load balancing with fallbacks for maximum reliability.
Test Weight Changes Gradually adjust weights and monitor impact before large changes.
Selection Overhead
Weight calculation: O(n) where n = number of targets
Random selection: O(n) worst case
Total overhead: < 0.1ms for typical configs
Memory Usage
Minimal per-request overhead:
Weight array: 8 bytes per target
Random value: 8 bytes
Selected index: 4 bytes
Distribution Quality
The gateway uses Math.random() which provides:
Uniform distribution
Sufficient randomness for load balancing
Fast execution (< 0.01ms)
Common Patterns
Primary-Secondary Split
{
"targets" : [
{ "provider" : "openai" , "api_key" : "primary" , "weight" : 0.9 },
{ "provider" : "anthropic" , "api_key" : "secondary" , "weight" : 0.1 }
]
}
Use secondary provider to validate primary or test alternatives.
Multi-Key Rate Limit Avoidance
{
"targets" : [
{ "provider" : "openai" , "api_key" : "key-1" },
{ "provider" : "openai" , "api_key" : "key-2" },
{ "provider" : "openai" , "api_key" : "key-3" },
{ "provider" : "openai" , "api_key" : "key-4" }
]
}
Each key gets 25% of traffic, avoiding rate limits.
{
"targets" : [
{ "provider" : "groq" , "weight" : 0.6 , "override_params" : { "model" : "llama-3" } },
{ "provider" : "openai" , "weight" : 0.4 , "override_params" : { "model" : "gpt-4o" } }
]
}
Balance cost (Groq) with quality (OpenAI).
Next Steps
Routing Learn about other routing strategies like fallback and conditional.
Configs Master the complete configuration system.
Retries Add retries to load-balanced requests.
Providers Understand provider capabilities for load balancing.