Skip to main content

Overview

The Router provides intelligent load balancing, fallbacks, and retries across multiple LLM deployments. This guide covers router-specific configuration.

Basic Router Setup

Python Configuration

from litellm import Router

router = Router(
    model_list=[
        {
            "model_name": "gpt-4",
            "litellm_params": {
                "model": "azure/gpt-4",
                "api_key": "your-key",
                "api_base": "https://your-endpoint.openai.azure.com/"
            },
            "tpm": 100000,
            "rpm": 1000
        },
        {
            "model_name": "gpt-4",
            "litellm_params": {
                "model": "gpt-4",
                "api_key": "your-openai-key"
            },
            "tpm": 90000,
            "rpm": 900
        }
    ],
    routing_strategy="usage-based-routing"
)

# Use the router
response = router.completion(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}]
)

YAML Configuration (for Proxy)

model_list:
  - model_name: gpt-4
    litellm_params:
      model: azure/gpt-4
      api_key: os.environ/AZURE_API_KEY
      api_base: https://your-endpoint.openai.azure.com/
    tpm: 100000
    rpm: 1000
  
  - model_name: gpt-4
    litellm_params:
      model: gpt-4
      api_key: os.environ/OPENAI_API_KEY
    tpm: 90000
    rpm: 900

router_settings:
  routing_strategy: usage-based-routing
  num_retries: 3
  timeout: 30
  fallbacks:
    - gpt-4: ["gpt-3.5-turbo"]

Routing Strategies

simple-shuffle (Default)

Randomly selects from available deployments.
router = Router(
    model_list=[...],
    routing_strategy="simple-shuffle"
)
Best for: Simple load distribution without specific requirements.

usage-based-routing

Respects TPM (tokens per minute) and RPM (requests per minute) limits.
router = Router(
    model_list=[
        {
            "model_name": "gpt-4",
            "litellm_params": {"model": "gpt-4", "api_key": "key1"},
            "tpm": 100000,  # 100K tokens per minute
            "rpm": 1000      # 1K requests per minute
        },
        {
            "model_name": "gpt-4",
            "litellm_params": {"model": "azure/gpt-4", "api_key": "key2"},
            "tpm": 200000,
            "rpm": 2000
        }
    ],
    routing_strategy="usage-based-routing"
)
Best for: Respecting provider rate limits and quotas.

latency-based-routing

Routes to the deployment with lowest latency.
router = Router(
    model_list=[...],
    routing_strategy="latency-based-routing",
    routing_strategy_args={
        "ttl": 60  # Cache latency measurements for 60 seconds
    }
)
Best for: Optimizing response time across geographic regions.

least-busy

Routes to deployment with fewest ongoing requests.
router = Router(
    model_list=[...],
    routing_strategy="least-busy"
)
Best for: Even load distribution in high-concurrency scenarios.

cost-based-routing

Routes to the cheapest deployment.
router = Router(
    model_list=[
        {
            "model_name": "gpt-4",
            "litellm_params": {"model": "gpt-4", "api_key": "key"},
            "model_info": {
                "input_cost_per_token": 0.00003,
                "output_cost_per_token": 0.00006
            }
        },
        {
            "model_name": "gpt-4",
            "litellm_params": {"model": "azure/gpt-4", "api_key": "key"},
            "model_info": {
                "input_cost_per_token": 0.000025,
                "output_cost_per_token": 0.00005
            }
        }
    ],
    routing_strategy="cost-based-routing"
)
Best for: Cost optimization.

Fallback Configuration

Basic Fallbacks

router = Router(
    model_list=[
        {"model_name": "gpt-4", "litellm_params": {"model": "gpt-4"}},
        {"model_name": "gpt-3.5-turbo", "litellm_params": {"model": "gpt-3.5-turbo"}},
        {"model_name": "claude-2", "litellm_params": {"model": "claude-2"}}
    ],
    fallbacks=[
        {"gpt-4": ["gpt-3.5-turbo", "claude-2"]}
    ]
)
YAML:
model_list:
  - model_name: gpt-4
    litellm_params:
      model: gpt-4
      api_key: os.environ/OPENAI_API_KEY
  
  - model_name: gpt-3.5-turbo
    litellm_params:
      model: gpt-3.5-turbo
      api_key: os.environ/OPENAI_API_KEY
  
  - model_name: claude-2
    litellm_params:
      model: claude-2
      api_key: os.environ/ANTHROPIC_API_KEY

litellm_settings:
  fallbacks:
    - gpt-4: ["gpt-3.5-turbo", "claude-2"]

Context Window Fallbacks

router = Router(
    model_list=[
        {"model_name": "gpt-3.5-turbo", "litellm_params": {"model": "gpt-3.5-turbo"}},
        {"model_name": "gpt-3.5-turbo-16k", "litellm_params": {"model": "gpt-3.5-turbo-16k"}},
        {"model_name": "gpt-4-32k", "litellm_params": {"model": "gpt-4-32k"}}
    ],
    context_window_fallbacks=[
        {"gpt-3.5-turbo": ["gpt-3.5-turbo-16k"]},
        {"gpt-4": ["gpt-4-32k"]}
    ]
)
YAML:
litellm_settings:
  context_window_fallbacks:
    - gpt-3.5-turbo: ["gpt-3.5-turbo-16k"]
    - gpt-4: ["gpt-4-32k"]

Retry Configuration

Basic Retries

router = Router(
    model_list=[...],
    num_retries=3,
    timeout=30,
    retry_after=5  # Wait 5s before retry
)

Per-Error Retry Policy

router = Router(
    model_list=[...],
    retry_policy={
        "RateLimitError": {"max_retries": 5},
        "Timeout": {"max_retries": 2},
        "InternalServerError": {"max_retries": 3}
    }
)
YAML:
router_settings:
  retry_policy:
    RateLimitError:
      max_retries: 5
    Timeout:
      max_retries: 2
    InternalServerError:
      max_retries: 3

Per-Model-Group Retry Policy

router = Router(
    model_list=[...],
    model_group_retry_policy={
        "gpt-4": {
            "RateLimitError": {"max_retries": 10}
        },
        "claude-2": {
            "RateLimitError": {"max_retries": 3}
        }
    }
)

Cooldown Configuration

router = Router(
    model_list=[...],
    allowed_fails=3,       # Allow 3 failures
    cooldown_time=120,     # 2 minute cooldown
    disable_cooldowns=False
)
YAML:
router_settings:
  allowed_fails: 3
  cooldown_time: 120

Caching Configuration

Redis Caching

router = Router(
    model_list=[...],
    cache_responses=True,
    redis_host="localhost",
    redis_port=6379,
    redis_password="your-password",
    default_cache_time_seconds=3600  # 1 hour
)
YAML:
router_settings:
  redis_host: localhost
  redis_port: 6379
  redis_password: os.environ/REDIS_PASSWORD
  cache_responses: true

litellm_settings:
  cache: true
  cache_params:
    type: redis
    ttl: 3600

In-Memory Caching

router = Router(
    model_list=[...],
    cache_responses=True,
    cache_kwargs={
        "type": "local"
    }
)

Model Aliases

router = Router(
    model_list=[
        {
            "model_name": "prod-gpt-4",
            "litellm_params": {"model": "gpt-4", "api_key": "key"}
        }
    ],
    model_group_alias={
        "gpt-4": "prod-gpt-4",
        "gpt4": "prod-gpt-4"
    }
)

# All of these work:
response = router.completion(model="gpt-4", messages=[...])
response = router.completion(model="gpt4", messages=[...])
response = router.completion(model="prod-gpt-4", messages=[...])
YAML:
router_settings:
  model_group_alias:
    gpt-4: prod-gpt-4
    gpt4: prod-gpt-4

Complete Production Example

model_list:
  # GPT-4 with multiple deployments
  - model_name: gpt-4
    litellm_params:
      model: azure/gpt-4
      api_base: https://eastus.openai.azure.com/
      api_key: os.environ/AZURE_KEY_EAST
      api_version: "2024-02-01"
    tpm: 100000
    rpm: 1000
    model_info:
      input_cost_per_token: 0.00003
      output_cost_per_token: 0.00006
  
  - model_name: gpt-4
    litellm_params:
      model: azure/gpt-4
      api_base: https://westus.openai.azure.com/
      api_key: os.environ/AZURE_KEY_WEST
      api_version: "2024-02-01"
    tpm: 150000
    rpm: 1500
    model_info:
      input_cost_per_token: 0.00003
      output_cost_per_token: 0.00006
  
  - model_name: gpt-4
    litellm_params:
      model: gpt-4
      api_key: os.environ/OPENAI_API_KEY
    tpm: 90000
    rpm: 900
    model_info:
      input_cost_per_token: 0.00003
      output_cost_per_token: 0.00006
  
  # Fallback models
  - model_name: gpt-3.5-turbo
    litellm_params:
      model: gpt-3.5-turbo
      api_key: os.environ/OPENAI_API_KEY
    tpm: 1000000
    rpm: 10000
  
  - model_name: claude-2
    litellm_params:
      model: claude-2
      api_key: os.environ/ANTHROPIC_API_KEY
    tpm: 100000
    rpm: 1000

router_settings:
  # Routing
  routing_strategy: latency-based-routing
  routing_strategy_args:
    ttl: 60
  
  # Model aliases
  model_group_alias:
    gpt-4: prod-gpt-4
    gpt4: prod-gpt-4
  
  # Retries
  num_retries: 3
  timeout: 30
  retry_policy:
    RateLimitError:
      max_retries: 5
    Timeout:
      max_retries: 2
  
  # Cooldowns
  allowed_fails: 3
  cooldown_time: 120
  
  # Caching
  redis_host: localhost
  redis_port: 6379
  redis_password: os.environ/REDIS_PASSWORD
  cache_responses: true

litellm_settings:
  # Fallbacks
  fallbacks:
    - gpt-4: ["gpt-3.5-turbo", "claude-2"]
  
  context_window_fallbacks:
    - gpt-3.5-turbo: ["gpt-3.5-turbo-16k"]
  
  # Callbacks
  success_callback: ["langfuse", "prometheus"]
  failure_callback: ["sentry"]
  
  # Settings
  set_verbose: false
  drop_params: true
  request_timeout: 300

Best Practices

  1. Use TPM/RPM limits: Always set limits to respect provider quotas
  2. Configure fallbacks: Have backup models for reliability
  3. Enable caching: Reduce costs and latency
  4. Monitor latency: Use latency-based routing in production
  5. Set appropriate timeouts: Balance responsiveness and success rate
  6. Use cooldowns: Prevent cascading failures
  7. Test retry policies: Ensure they match your use case
  8. Use model aliases: Abstract model names for easier updates

Build docs developers (and LLMs) love