Skip to main content

Overview

LiteLLM provides robust fallback mechanisms to ensure high availability of your LLM applications. When a model fails or is unavailable, LiteLLM automatically retries with fallback models or deployments.

How Fallbacks Work

Fallbacks execute in order when:
  • API returns an error (rate limit, timeout, service unavailable)
  • Model deployment is down
  • Context window is exceeded
  • Content policy violations occur

Basic Fallback Configuration

Single Model Fallback

import litellm
from litellm import completion

response = completion(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello!"}],
    fallbacks=["gpt-3.5-turbo", "claude-2"]
)
# Tries: gpt-4 -> gpt-3.5-turbo -> claude-2

Router Fallbacks

The Router provides advanced fallback logic across multiple deployments:
from litellm import Router

router = Router(
    model_list=[
        {
            "model_name": "gpt-4",
            "litellm_params": {
                "model": "gpt-4",
                "api_key": "sk-..."
            }
        },
        {
            "model_name": "gpt-3.5",
            "litellm_params": {
                "model": "gpt-3.5-turbo",
                "api_key": "sk-..."
            }
        },
        {
            "model_name": "claude",
            "litellm_params": {
                "model": "claude-2",
                "api_key": "sk-ant-..."
            }
        }
    ],
    fallbacks=[
        {"gpt-4": ["gpt-3.5", "claude"]},
        {"gpt-3.5": ["claude"]}
    ],
    max_fallbacks=5  # Maximum fallback attempts
)

response = router.completion(
    model="gpt-4",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

Fallback Types

1. Default Fallbacks

Apply to all models globally:
from litellm import Router

router = Router(
    model_list=[...],
    default_fallbacks=["gpt-3.5-turbo", "claude-2"]
)
# Any failing model will try these fallbacks

2. Model-Specific Fallbacks

Define fallbacks per model:
router = Router(
    model_list=[...],
    fallbacks=[
        {"gpt-4": ["gpt-4-turbo", "gpt-3.5-turbo"]},
        {"claude-3-opus": ["claude-3-sonnet", "claude-2"]},
        {"gemini-pro": ["gpt-3.5-turbo"]}
    ]
)

3. Context Window Fallbacks

Automatic fallback when context window is exceeded:
router = Router(
    model_list=[
        {
            "model_name": "gpt-3.5",
            "litellm_params": {"model": "gpt-3.5-turbo"}  # 4K context
        },
        {
            "model_name": "gpt-3.5-16k",
            "litellm_params": {"model": "gpt-3.5-turbo-16k"}  # 16K context
        },
        {
            "model_name": "claude-100k",
            "litellm_params": {"model": "claude-2"}  # 100K context
        }
    ],
    context_window_fallbacks=[
        {"gpt-3.5": ["gpt-3.5-16k", "claude-100k"]}
    ]
)

# Automatically falls back if prompt exceeds 4K tokens
response = router.completion(
    model="gpt-3.5",
    messages=[{"role": "user", "content": very_long_prompt}]
)

4. Content Policy Fallbacks

Fallback when content policy violations occur:
router = Router(
    model_list=[...],
    content_policy_fallbacks=[
        {"gpt-4": ["claude-2"]},  # Claude may have different policies
        {"gpt-3.5-turbo": ["llama-2"]}
    ]
)

Advanced Fallback Configuration

Fallback with Custom Parameters

Pass different parameters to fallback models:
response = completion(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}],
    fallbacks=[
        {
            "model": "gpt-3.5-turbo",
            "temperature": 0.5,
            "max_tokens": 100
        },
        {
            "model": "claude-2",
            "temperature": 0.7
        }
    ]
)

Controlling Fallback Behavior

router = Router(
    model_list=[...],
    max_fallbacks=3,  # Maximum number of fallback attempts
    retry_after=5,  # Wait 5 seconds between fallback attempts
    allowed_fails=2,  # Attempts before marking deployment as failed
    cooldown_time=60,  # Cooldown period for failed deployments (seconds)
    disable_cooldowns=False  # Enable/disable cooldown mechanism
)

Fallback Policies

Allowed Fails Policy

Control when a deployment enters cooldown:
from litellm.types.router import AllowedFailsPolicy

router = Router(
    model_list=[...],
    allowed_fails=3,  # Number of failures before cooldown
    allowed_fails_policy=AllowedFailsPolicy(
        BadRequestErrorAllowedFails=0,  # Immediate cooldown for bad requests
        AuthenticationErrorAllowedFails=0,  # Immediate cooldown for auth errors
        TimeoutErrorAllowedFails=5,  # More tolerance for timeouts
        RateLimitErrorAllowedFails=2,  # Moderate tolerance for rate limits
        ContentPolicyViolationErrorAllowedFails=1
    )
)

Deployment Cooldown

When a deployment fails multiple times, it enters a cooldown period:
router = Router(
    model_list=[...],
    allowed_fails=2,  # Fail 2 times before cooldown
    cooldown_time=300,  # Cooldown for 5 minutes
    disable_cooldowns=False
)
During cooldown, the deployment is excluded from routing but can still be used as a last resort if all others fail.

Async Fallback Support

Fallbacks work with async operations:
import asyncio
from litellm import acompletion

async def make_request():
    response = await acompletion(
        model="gpt-4",
        messages=[{"role": "user", "content": "Hello!"}],
        fallbacks=["gpt-3.5-turbo", "claude-2"]
    )
    return response

response = asyncio.run(make_request())

Monitoring Fallbacks

Tracking Fallback Usage

from litellm.integrations import CustomLogger

class FallbackLogger(CustomLogger):
    def log_success_event(self, kwargs, response_obj, start_time, end_time):
        # Check if fallback was used
        model_used = response_obj.model
        original_model = kwargs.get("model")
        
        if model_used != original_model:
            print(f"Fallback occurred: {original_model} -> {model_used}")

litellm.callbacks = [FallbackLogger()]

Response Headers

Fallback information is included in response headers:
response = router.completion(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}]
)

# Check response headers for fallback info
print(response._hidden_params.get("model_id"))  # Actual deployment used
print(response._hidden_params.get("api_base"))  # API endpoint used

Best Practices

Fallback Strategy Recommendations

  1. Order by capability - Place most capable/expensive models first
  2. Consider cost - Fallback to cheaper alternatives when appropriate
  3. Mix providers - Diversify across OpenAI, Anthropic, Google, etc.
  4. Test thoroughly - Verify fallbacks work as expected
  5. Monitor cooldowns - Alert when deployments enter cooldown
  6. Set reasonable limits - Balance availability vs. cost

Common Patterns

High Availability Pattern

router = Router(
    model_list=[
        # Primary: Multiple GPT-4 deployments
        {"model_name": "gpt-4", "litellm_params": {...}},
        {"model_name": "gpt-4", "litellm_params": {...}},
        # Fallback: GPT-3.5 Turbo
        {"model_name": "gpt-3.5", "litellm_params": {...}},
        {"model_name": "gpt-3.5", "litellm_params": {...}},
        # Final fallback: Claude
        {"model_name": "claude", "litellm_params": {...}}
    ],
    fallbacks=[{"gpt-4": ["gpt-3.5", "claude"]}],
    max_fallbacks=10
)

Cost-Optimized Pattern

router = Router(
    model_list=[
        {"model_name": "cheap", "litellm_params": {"model": "gpt-3.5-turbo"}},
        {"model_name": "expensive", "litellm_params": {"model": "gpt-4"}}
    ],
    fallbacks=[{"cheap": ["expensive"]}]  # Only use expensive when cheap fails
)

Error Handling

from litellm import completion
from litellm.exceptions import APIError, RateLimitError

try:
    response = completion(
        model="gpt-4",
        messages=[{"role": "user", "content": "Hello"}],
        fallbacks=["gpt-3.5-turbo", "claude-2"]
    )
except Exception as e:
    # All models failed, including fallbacks
    print(f"All models failed: {str(e)}")
    # Enable verbose logging to see details
    litellm.set_verbose = True

Build docs developers (and LLMs) love