Load Balancing

Overview

PicoClaw supports load balancing across multiple LLM provider endpoints. This enables:

High availability - Automatic failover if one endpoint is down
Rate limit avoidance - Distribute requests across multiple API keys
Cost optimization - Route to cheaper endpoints or free tiers
Geographic distribution - Use regional endpoints for lower latency

How Load Balancing Works

When you configure multiple entries with the same model_name, PicoClaw uses round-robin selection:

{
  "model_list": [
    {
      "model_name": "gpt-4",
      "model": "openai/gpt-4",
      "api_key": "sk-key1"
    },
    {
      "model_name": "gpt-4",
      "model": "openai/gpt-4",
      "api_key": "sk-key2"
    },
    {
      "model_name": "gpt-4",
      "model": "openai/gpt-4",
      "api_key": "sk-key3"
    }
  ]
}

Request routing:

First request → Entry 1 (sk-key1)
Second request → Entry 2 (sk-key2)
Third request → Entry 3 (sk-key3)
Fourth request → Entry 1 (sk-key1) (round-robin restarts)

Round-robin selection happens at the time of each LLM request, distributing load evenly across all configured endpoints.

Use Cases

1. Multiple API Keys (Rate Limit Avoidance)

Distribute requests across multiple API keys for the same provider:

{
  "model_list": [
    {
      "model_name": "claude-sonnet",
      "model": "anthropic/claude-sonnet-4.6",
      "api_key": "sk-ant-key1"
    },
    {
      "model_name": "claude-sonnet",
      "model": "anthropic/claude-sonnet-4.6",
      "api_key": "sk-ant-key2"
    },
    {
      "model_name": "claude-sonnet",
      "model": "anthropic/claude-sonnet-4.6",
      "api_key": "sk-ant-key3"
    }
  ]
}

Benefits:

Avoid hitting rate limits on a single key
Increase effective request throughput
Maintain service during key rotation

2. Geographic Distribution

Use regional endpoints for lower latency:

{
  "model_list": [
    {
      "model_name": "gpt-4",
      "model": "openai/gpt-4",
      "api_base": "https://us-east.api.openai.com/v1",
      "api_key": "sk-us-key"
    },
    {
      "model_name": "gpt-4",
      "model": "openai/gpt-4",
      "api_base": "https://eu-west.api.openai.com/v1",
      "api_key": "sk-eu-key"
    },
    {
      "model_name": "gpt-4",
      "model": "openai/gpt-4",
      "api_base": "https://ap-southeast.api.openai.com/v1",
      "api_key": "sk-ap-key"
    }
  ]
}

3. Mixed Providers (Cost Optimization)

Balance between expensive and cheap providers:

{
  "model_list": [
    {
      "model_name": "smart-agent",
      "model": "openai/gpt-4",
      "api_key": "sk-openai-key"
    },
    {
      "model_name": "smart-agent",
      "model": "anthropic/claude-sonnet-4.6",
      "api_key": "sk-ant-key"
    },
    {
      "model_name": "smart-agent",
      "model": "deepseek/deepseek-chat",
      "api_key": "sk-deepseek-key"
    }
  ]
}

Result: Requests are distributed across OpenAI, Anthropic, and DeepSeek, averaging out costs.

4. Primary + Backup (High Availability)

Combine load balancing with fallback chains:

{
  "model_list": [
    {
      "model_name": "production",
      "model": "openai/gpt-4",
      "api_base": "https://primary.api.com/v1",
      "api_key": "primary-key"
    },
    {
      "model_name": "production",
      "model": "openai/gpt-4",
      "api_base": "https://backup.api.com/v1",
      "api_key": "backup-key"
    }
  ],
  "agents": {
    "defaults": {
      "model": "production",
      "fallbacks": ["production", "deepseek/deepseek-chat"]
    }
  }
}

Behavior:

Normal: Round-robin between primary and backup
If primary fails: Use backup endpoint
If both fail: Fallback to DeepSeek

Configuration Examples

Load Balancing with Timeout

Set per-endpoint timeouts for faster failover:

{
  "model_list": [
    {
      "model_name": "fast-agent",
      "model": "groq/llama-3.3-70b-versatile",
      "api_key": "gsk-key1",
      "request_timeout": 30
    },
    {
      "model_name": "fast-agent",
      "model": "groq/llama-3.3-70b-versatile",
      "api_key": "gsk-key2",
      "request_timeout": 30
    }
  ]
}

Multi-Region LiteLLM Proxy

Balance across multiple LiteLLM proxy instances:

{
  "model_list": [
    {
      "model_name": "proxy-model",
      "model": "litellm/gpt-4",
      "api_base": "http://us-proxy:4000/v1",
      "api_key": "sk-litellm-us"
    },
    {
      "model_name": "proxy-model",
      "model": "litellm/gpt-4",
      "api_base": "http://eu-proxy:4000/v1",
      "api_key": "sk-litellm-eu"
    }
  ]
}

Self-Hosted VLLM Cluster

Balance across multiple VLLM inference servers:

{
  "model_list": [
    {
      "model_name": "llama-local",
      "model": "vllm/llama-3-8b",
      "api_base": "http://gpu-node-1:8000/v1"
    },
    {
      "model_name": "llama-local",
      "model": "vllm/llama-3-8b",
      "api_base": "http://gpu-node-2:8000/v1"
    },
    {
      "model_name": "llama-local",
      "model": "vllm/llama-3-8b",
      "api_base": "http://gpu-node-3:8000/v1"
    }
  ]
}

Monitoring Load Distribution

Check which endpoints are being used:

picoclaw status

This shows all configured providers and their health status.

Best Practices

Use same model across endpoints

For consistent behavior, use the same model (e.g., gpt-4) across all balanced endpoints

Set reasonable timeouts

Configure request_timeout to detect slow endpoints quickly

Monitor usage

Track which endpoints are being used and adjust distribution as needed

Test failover

Regularly test that backup endpoints work correctly

Combine with fallbacks

Use fallback chains for ultimate reliability

Consider costs

Balance load across free tiers to maximize value

Load Balancing vs Fallbacks

Feature	Load Balancing	Fallback Chain
Purpose	Distribute requests evenly	Handle failures gracefully
Selection	Round-robin	Sequential on error
Use case	Rate limits, cost, latency	High availability, redundancy
Configuration	Same `model_name` multiple times	`fallbacks` array

Use both together for maximum reliability and performance.

Troubleshooting

Requests only going to one endpoint

Ensure all entries have the exact same model_name
Check that all endpoints are configured correctly
Verify API keys are valid

High latency on some requests

One or more endpoints may be slow
Set request_timeout to fail fast and retry
Consider removing slow endpoints from rotation

Hitting rate limits despite load balancing

You may need more API keys
Check if requests are concentrated on certain endpoints
Ensure round-robin is working (check logs)

Next Steps

Model Configuration

Complete guide to model_list configuration

Provider API

Set up custom endpoints and proxies

Agent Config

Configure agents with fallback chains

Providers

Understand the provider system architecture

Get Started

Core Concepts

Commands

Configuration

Chat Channels

Features

Deployment

Advanced

Overview

How Load Balancing Works

Use Cases

1. Multiple API Keys (Rate Limit Avoidance)

2. Geographic Distribution

3. Mixed Providers (Cost Optimization)

4. Primary + Backup (High Availability)

Configuration Examples

Load Balancing with Timeout

Multi-Region LiteLLM Proxy

Self-Hosted VLLM Cluster

Monitoring Load Distribution

Best Practices

Use same model across endpoints

Set reasonable timeouts

Monitor usage

Test failover

Combine with fallbacks

Consider costs

Load Balancing vs Fallbacks

Troubleshooting

Next Steps

Model Configuration

Provider API

Agent Config

Providers

Build docs developers (and LLMs) love

Get Started

Core Concepts

Commands

Configuration

Chat Channels

Features

Deployment

Advanced

​Overview

​How Load Balancing Works

​Use Cases

​1. Multiple API Keys (Rate Limit Avoidance)

​2. Geographic Distribution

​3. Mixed Providers (Cost Optimization)

​4. Primary + Backup (High Availability)

​Configuration Examples

​Load Balancing with Timeout

​Multi-Region LiteLLM Proxy

​Self-Hosted VLLM Cluster

​Monitoring Load Distribution

​Best Practices

Use same model across endpoints

Set reasonable timeouts

Monitor usage

Test failover

Combine with fallbacks

Consider costs

​Load Balancing vs Fallbacks

​Troubleshooting

​Next Steps

Model Configuration

Provider API

Agent Config

Providers

Build docs developers (and LLMs) love

Overview

How Load Balancing Works

Use Cases

1. Multiple API Keys (Rate Limit Avoidance)

2. Geographic Distribution

3. Mixed Providers (Cost Optimization)

4. Primary + Backup (High Availability)

Configuration Examples

Load Balancing with Timeout

Multi-Region LiteLLM Proxy

Self-Hosted VLLM Cluster

Monitoring Load Distribution

Best Practices

Load Balancing vs Fallbacks

Troubleshooting

Next Steps