Skip to main content

Overview

PicoClaw supports load balancing across multiple LLM provider endpoints. This enables:
  • High availability - Automatic failover if one endpoint is down
  • Rate limit avoidance - Distribute requests across multiple API keys
  • Cost optimization - Route to cheaper endpoints or free tiers
  • Geographic distribution - Use regional endpoints for lower latency

How Load Balancing Works

When you configure multiple entries with the same model_name, PicoClaw uses round-robin selection:
{
  "model_list": [
    {
      "model_name": "gpt-4",
      "model": "openai/gpt-4",
      "api_key": "sk-key1"
    },
    {
      "model_name": "gpt-4",
      "model": "openai/gpt-4",
      "api_key": "sk-key2"
    },
    {
      "model_name": "gpt-4",
      "model": "openai/gpt-4",
      "api_key": "sk-key3"
    }
  ]
}
Request routing:
  1. First request → Entry 1 (sk-key1)
  2. Second request → Entry 2 (sk-key2)
  3. Third request → Entry 3 (sk-key3)
  4. Fourth request → Entry 1 (sk-key1) (round-robin restarts)
Round-robin selection happens at the time of each LLM request, distributing load evenly across all configured endpoints.

Use Cases

1. Multiple API Keys (Rate Limit Avoidance)

Distribute requests across multiple API keys for the same provider:
{
  "model_list": [
    {
      "model_name": "claude-sonnet",
      "model": "anthropic/claude-sonnet-4.6",
      "api_key": "sk-ant-key1"
    },
    {
      "model_name": "claude-sonnet",
      "model": "anthropic/claude-sonnet-4.6",
      "api_key": "sk-ant-key2"
    },
    {
      "model_name": "claude-sonnet",
      "model": "anthropic/claude-sonnet-4.6",
      "api_key": "sk-ant-key3"
    }
  ]
}
Benefits:
  • Avoid hitting rate limits on a single key
  • Increase effective request throughput
  • Maintain service during key rotation

2. Geographic Distribution

Use regional endpoints for lower latency:
{
  "model_list": [
    {
      "model_name": "gpt-4",
      "model": "openai/gpt-4",
      "api_base": "https://us-east.api.openai.com/v1",
      "api_key": "sk-us-key"
    },
    {
      "model_name": "gpt-4",
      "model": "openai/gpt-4",
      "api_base": "https://eu-west.api.openai.com/v1",
      "api_key": "sk-eu-key"
    },
    {
      "model_name": "gpt-4",
      "model": "openai/gpt-4",
      "api_base": "https://ap-southeast.api.openai.com/v1",
      "api_key": "sk-ap-key"
    }
  ]
}

3. Mixed Providers (Cost Optimization)

Balance between expensive and cheap providers:
{
  "model_list": [
    {
      "model_name": "smart-agent",
      "model": "openai/gpt-4",
      "api_key": "sk-openai-key"
    },
    {
      "model_name": "smart-agent",
      "model": "anthropic/claude-sonnet-4.6",
      "api_key": "sk-ant-key"
    },
    {
      "model_name": "smart-agent",
      "model": "deepseek/deepseek-chat",
      "api_key": "sk-deepseek-key"
    }
  ]
}
Result: Requests are distributed across OpenAI, Anthropic, and DeepSeek, averaging out costs.

4. Primary + Backup (High Availability)

Combine load balancing with fallback chains:
{
  "model_list": [
    {
      "model_name": "production",
      "model": "openai/gpt-4",
      "api_base": "https://primary.api.com/v1",
      "api_key": "primary-key"
    },
    {
      "model_name": "production",
      "model": "openai/gpt-4",
      "api_base": "https://backup.api.com/v1",
      "api_key": "backup-key"
    }
  ],
  "agents": {
    "defaults": {
      "model": "production",
      "fallbacks": ["production", "deepseek/deepseek-chat"]
    }
  }
}
Behavior:
  • Normal: Round-robin between primary and backup
  • If primary fails: Use backup endpoint
  • If both fail: Fallback to DeepSeek

Configuration Examples

Load Balancing with Timeout

Set per-endpoint timeouts for faster failover:
{
  "model_list": [
    {
      "model_name": "fast-agent",
      "model": "groq/llama-3.3-70b-versatile",
      "api_key": "gsk-key1",
      "request_timeout": 30
    },
    {
      "model_name": "fast-agent",
      "model": "groq/llama-3.3-70b-versatile",
      "api_key": "gsk-key2",
      "request_timeout": 30
    }
  ]
}

Multi-Region LiteLLM Proxy

Balance across multiple LiteLLM proxy instances:
{
  "model_list": [
    {
      "model_name": "proxy-model",
      "model": "litellm/gpt-4",
      "api_base": "http://us-proxy:4000/v1",
      "api_key": "sk-litellm-us"
    },
    {
      "model_name": "proxy-model",
      "model": "litellm/gpt-4",
      "api_base": "http://eu-proxy:4000/v1",
      "api_key": "sk-litellm-eu"
    }
  ]
}

Self-Hosted VLLM Cluster

Balance across multiple VLLM inference servers:
{
  "model_list": [
    {
      "model_name": "llama-local",
      "model": "vllm/llama-3-8b",
      "api_base": "http://gpu-node-1:8000/v1"
    },
    {
      "model_name": "llama-local",
      "model": "vllm/llama-3-8b",
      "api_base": "http://gpu-node-2:8000/v1"
    },
    {
      "model_name": "llama-local",
      "model": "vllm/llama-3-8b",
      "api_base": "http://gpu-node-3:8000/v1"
    }
  ]
}

Monitoring Load Distribution

Check which endpoints are being used:
picoclaw status
This shows all configured providers and their health status.

Best Practices

Use same model across endpoints

For consistent behavior, use the same model (e.g., gpt-4) across all balanced endpoints

Set reasonable timeouts

Configure request_timeout to detect slow endpoints quickly

Monitor usage

Track which endpoints are being used and adjust distribution as needed

Test failover

Regularly test that backup endpoints work correctly

Combine with fallbacks

Use fallback chains for ultimate reliability

Consider costs

Balance load across free tiers to maximize value

Load Balancing vs Fallbacks

FeatureLoad BalancingFallback Chain
PurposeDistribute requests evenlyHandle failures gracefully
SelectionRound-robinSequential on error
Use caseRate limits, cost, latencyHigh availability, redundancy
ConfigurationSame model_name multiple timesfallbacks array
Use both together for maximum reliability and performance.

Troubleshooting

  • Ensure all entries have the exact same model_name
  • Check that all endpoints are configured correctly
  • Verify API keys are valid
  • One or more endpoints may be slow
  • Set request_timeout to fail fast and retry
  • Consider removing slow endpoints from rotation
  • You may need more API keys
  • Check if requests are concentrated on certain endpoints
  • Ensure round-robin is working (check logs)

Next Steps

Model Configuration

Complete guide to model_list configuration

Provider API

Set up custom endpoints and proxies

Agent Config

Configure agents with fallback chains

Providers

Understand the provider system architecture

Build docs developers (and LLMs) love