Skip to main content

Overview

LiteLLM automatically retries failed requests with intelligent backoff strategies. Retries help handle transient failures like rate limits, timeouts, and temporary service disruptions.

Default Retry Behavior

By default, LiteLLM retries requests 2 times (3 total attempts including the initial request).
import litellm
from litellm import completion

# Uses default retry behavior (2 retries)
response = completion(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello!"}]
)

Configuring Retries

Global Retry Configuration

Set retries for all requests:
import litellm

# Set global retry count
litellm.num_retries = 3

response = completion(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello!"}]
)
# Will retry up to 3 times on failure

Per-Request Retry Configuration

response = completion(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello!"}],
    num_retries=5  # Override global setting
)

Router Retry Configuration

from litellm import Router

router = Router(
    model_list=[
        {
            "model_name": "gpt-4",
            "litellm_params": {
                "model": "gpt-4",
                "api_key": "sk-..."
            }
        }
    ],
    num_retries=3,  # Default retries for all models
    retry_after=5  # Wait 5 seconds before first retry
)

Retry Policies

Customize retry behavior based on error type:

Basic Retry Policy

from litellm import Router
from litellm.types.router import RetryPolicy

router = Router(
    model_list=[...],
    retry_policy=RetryPolicy(
        TimeoutErrorRetries=5,
        RateLimitErrorRetries=3,
        InternalServerErrorRetries=2,
        ContentPolicyViolationErrorRetries=0,  # Don't retry these
        AuthenticationErrorRetries=0  # Don't retry auth errors
    )
)

Available Error Types

Configure retries for specific error types:
  • TimeoutErrorRetries - Connection/request timeouts
  • RateLimitErrorRetries - Rate limit (429) errors
  • InternalServerErrorRetries - Server errors (500, 502, 503, 504)
  • BadRequestErrorRetries - Bad request (400) errors
  • AuthenticationErrorRetries - Authentication (401, 403) errors
  • ContentPolicyViolationErrorRetries - Content filtering errors
  • UnsupportedParamsRetries - Unsupported parameter errors

Model Group Retry Policies

Set different retry policies for different model groups:
from litellm import Router
from litellm.types.router import RetryPolicy

router = Router(
    model_list=[
        {"model_name": "gpt-4", "litellm_params": {...}},
        {"model_name": "gpt-3.5", "litellm_params": {...}},
        {"model_name": "claude", "litellm_params": {...}}
    ],
    # Global retry policy
    retry_policy=RetryPolicy(
        TimeoutErrorRetries=3,
        RateLimitErrorRetries=2
    ),
    # Model-specific retry policies
    model_group_retry_policy={
        "gpt-4": RetryPolicy(
            TimeoutErrorRetries=5,  # More retries for expensive model
            RateLimitErrorRetries=10
        ),
        "claude": RetryPolicy(
            TimeoutErrorRetries=2,
            RateLimitErrorRetries=1
        )
    }
)

Retry Timing

Retry After

Set minimum wait time before retrying:
router = Router(
    model_list=[...],
    retry_after=10  # Wait at least 10 seconds before retry
)

Exponential Backoff

LiteLLM uses exponential backoff automatically:
# Retry timing with exponential backoff:
# 1st retry: 0-2 seconds
# 2nd retry: 0-4 seconds  
# 3rd retry: 0-8 seconds
# 4th retry: 0-16 seconds
# etc.

response = completion(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello!"}],
    num_retries=4
)
Exponential backoff helps avoid overwhelming rate-limited services and increases the chance of successful retries.

Streaming with Retries

Retries work with streaming responses:
response = completion(
    model="gpt-4",
    messages=[{"role": "user", "content": "Write a story"}],
    stream=True,
    num_retries=3
)

for chunk in response:
    print(chunk.choices[0].delta.content, end="")

Async Retries

Retries work seamlessly with async operations:
import asyncio
from litellm import acompletion

async def make_request():
    response = await acompletion(
        model="gpt-4",
        messages=[{"role": "user", "content": "Hello!"}],
        num_retries=5
    )
    return response

response = asyncio.run(make_request())

Monitoring Retries

Custom Retry Logging

from litellm.integrations import CustomLogger
import litellm

class RetryLogger(CustomLogger):
    def log_failure_event(self, kwargs, response_obj, start_time, end_time):
        print(f"Request failed: {kwargs.get('model')}")
        print(f"Exception: {kwargs.get('exception')}")
        print(f"Retry attempt: {kwargs.get('litellm_call_id')}")
    
    def log_success_event(self, kwargs, response_obj, start_time, end_time):
        print(f"Request succeeded after retries")

litellm.callbacks = [RetryLogger()]

Response Headers

Retry information is included in response metadata:
response = router.completion(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}],
    num_retries=3
)

# Check retry headers
retry_count = response._hidden_params.get("retry_count", 0)
print(f"Number of retries: {retry_count}")

Retry vs Fallback

Understanding the Difference

Retries: Attempt the same model/deployment multiple timesFallbacks: Switch to a different model/deployment after retries failExecution Order: LiteLLM tries retries first, then fallbacks

Combined Retry and Fallback

router = Router(
    model_list=[
        {"model_name": "gpt-4", "litellm_params": {...}},
        {"model_name": "gpt-3.5", "litellm_params": {...}}
    ],
    num_retries=3,  # Retry each model 3 times
    fallbacks=[{"gpt-4": ["gpt-3.5"]}]  # Then fallback to gpt-3.5
)

# Execution flow:
# 1. Try gpt-4 (retry up to 3 times)
# 2. If all retries fail, try gpt-3.5 (retry up to 3 times)

Best Practices

  1. Set appropriate retry counts
    • More retries for critical requests
    • Fewer retries for latency-sensitive applications
  2. Configure by error type
    • Retry timeouts and rate limits aggressively
    • Don’t retry authentication or validation errors
  3. Use exponential backoff
    • Already built-in, respects API rate limits
  4. Monitor retry rates
    • High retry rates indicate underlying issues
    • Track which models/deployments need retries
  5. Combine with fallbacks
    • Use retries for transient failures
    • Use fallbacks for persistent failures
  6. Set timeouts
    • Prevent retries from taking too long
    response = completion(
        model="gpt-4",
        messages=[{"role": "user", "content": "Hello"}],
        num_retries=3,
        timeout=30  # Total timeout for all retries
    )
    

Common Retry Scenarios

Rate Limit Handling

from litellm import Router
from litellm.types.router import RetryPolicy

router = Router(
    model_list=[...],
    retry_policy=RetryPolicy(
        RateLimitErrorRetries=10  # Retry rate limits many times
    ),
    retry_after=60  # Wait 1 minute before first retry
)

Timeout Handling

router = Router(
    model_list=[...],
    retry_policy=RetryPolicy(
        TimeoutErrorRetries=5
    ),
    timeout=30,  # 30 second timeout per request
    stream_timeout=60  # 60 second timeout for streaming
)

Production Configuration

router = Router(
    model_list=[...],
    retry_policy=RetryPolicy(
        TimeoutErrorRetries=3,
        RateLimitErrorRetries=5,
        InternalServerErrorRetries=2,
        ContentPolicyViolationErrorRetries=0,
        AuthenticationErrorRetries=0
    ),
    num_retries=3,
    retry_after=5,
    timeout=60
)

Disabling Retries

# Disable retries globally
litellm.num_retries = 0

# Or per request
response = completion(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello!"}],
    num_retries=0
)

Error Handling

from litellm import completion
from litellm.exceptions import (
    Timeout,
    RateLimitError,
    ServiceUnavailableError
)

try:
    response = completion(
        model="gpt-4",
        messages=[{"role": "user", "content": "Hello!"}],
        num_retries=3
    )
except RateLimitError as e:
    print(f"Rate limited after retries: {str(e)}")
except Timeout as e:
    print(f"Timeout after retries: {str(e)}")
except Exception as e:
    print(f"Failed after all retries: {str(e)}")

Build docs developers (and LLMs) love