Overview
PicoClaw supports load balancing across multiple LLM provider endpoints. This enables:- High availability - Automatic failover if one endpoint is down
- Rate limit avoidance - Distribute requests across multiple API keys
- Cost optimization - Route to cheaper endpoints or free tiers
- Geographic distribution - Use regional endpoints for lower latency
How Load Balancing Works
When you configure multiple entries with the samemodel_name, PicoClaw uses round-robin selection:
- First request → Entry 1 (sk-key1)
- Second request → Entry 2 (sk-key2)
- Third request → Entry 3 (sk-key3)
- Fourth request → Entry 1 (sk-key1) (round-robin restarts)
Round-robin selection happens at the time of each LLM request, distributing load evenly across all configured endpoints.
Use Cases
1. Multiple API Keys (Rate Limit Avoidance)
Distribute requests across multiple API keys for the same provider:- Avoid hitting rate limits on a single key
- Increase effective request throughput
- Maintain service during key rotation
2. Geographic Distribution
Use regional endpoints for lower latency:3. Mixed Providers (Cost Optimization)
Balance between expensive and cheap providers:4. Primary + Backup (High Availability)
Combine load balancing with fallback chains:- Normal: Round-robin between primary and backup
- If primary fails: Use backup endpoint
- If both fail: Fallback to DeepSeek
Configuration Examples
Load Balancing with Timeout
Set per-endpoint timeouts for faster failover:Multi-Region LiteLLM Proxy
Balance across multiple LiteLLM proxy instances:Self-Hosted VLLM Cluster
Balance across multiple VLLM inference servers:Monitoring Load Distribution
Check which endpoints are being used:Best Practices
Use same model across endpoints
For consistent behavior, use the same model (e.g., gpt-4) across all balanced endpoints
Set reasonable timeouts
Configure
request_timeout to detect slow endpoints quicklyMonitor usage
Track which endpoints are being used and adjust distribution as needed
Test failover
Regularly test that backup endpoints work correctly
Combine with fallbacks
Use fallback chains for ultimate reliability
Consider costs
Balance load across free tiers to maximize value
Load Balancing vs Fallbacks
| Feature | Load Balancing | Fallback Chain |
|---|---|---|
| Purpose | Distribute requests evenly | Handle failures gracefully |
| Selection | Round-robin | Sequential on error |
| Use case | Rate limits, cost, latency | High availability, redundancy |
| Configuration | Same model_name multiple times | fallbacks array |
Troubleshooting
Requests only going to one endpoint
Requests only going to one endpoint
- Ensure all entries have the exact same
model_name - Check that all endpoints are configured correctly
- Verify API keys are valid
High latency on some requests
High latency on some requests
- One or more endpoints may be slow
- Set
request_timeoutto fail fast and retry - Consider removing slow endpoints from rotation
Hitting rate limits despite load balancing
Hitting rate limits despite load balancing
- You may need more API keys
- Check if requests are concentrated on certain endpoints
- Ensure round-robin is working (check logs)
Next Steps
Model Configuration
Complete guide to model_list configuration
Provider API
Set up custom endpoints and proxies
Agent Config
Configure agents with fallback chains
Providers
Understand the provider system architecture