Overview
LiteLLM automatically retries failed requests with intelligent backoff strategies. Retries help handle transient failures like rate limits, timeouts, and temporary service disruptions.
Default Retry Behavior
By default, LiteLLM retries requests 2 times (3 total attempts including the initial request).
import litellm
from litellm import completion
# Uses default retry behavior (2 retries)
response = completion(
model="gpt-4",
messages=[{"role": "user", "content": "Hello!"}]
)
Configuring Retries
Global Retry Configuration
Set retries for all requests:
import litellm
# Set global retry count
litellm.num_retries = 3
response = completion(
model="gpt-4",
messages=[{"role": "user", "content": "Hello!"}]
)
# Will retry up to 3 times on failure
Per-Request Retry Configuration
response = completion(
model="gpt-4",
messages=[{"role": "user", "content": "Hello!"}],
num_retries=5 # Override global setting
)
Router Retry Configuration
from litellm import Router
router = Router(
model_list=[
{
"model_name": "gpt-4",
"litellm_params": {
"model": "gpt-4",
"api_key": "sk-..."
}
}
],
num_retries=3, # Default retries for all models
retry_after=5 # Wait 5 seconds before first retry
)
Retry Policies
Customize retry behavior based on error type:
Basic Retry Policy
from litellm import Router
from litellm.types.router import RetryPolicy
router = Router(
model_list=[...],
retry_policy=RetryPolicy(
TimeoutErrorRetries=5,
RateLimitErrorRetries=3,
InternalServerErrorRetries=2,
ContentPolicyViolationErrorRetries=0, # Don't retry these
AuthenticationErrorRetries=0 # Don't retry auth errors
)
)
Available Error Types
Configure retries for specific error types:
TimeoutErrorRetries - Connection/request timeouts
RateLimitErrorRetries - Rate limit (429) errors
InternalServerErrorRetries - Server errors (500, 502, 503, 504)
BadRequestErrorRetries - Bad request (400) errors
AuthenticationErrorRetries - Authentication (401, 403) errors
ContentPolicyViolationErrorRetries - Content filtering errors
UnsupportedParamsRetries - Unsupported parameter errors
Model Group Retry Policies
Set different retry policies for different model groups:
from litellm import Router
from litellm.types.router import RetryPolicy
router = Router(
model_list=[
{"model_name": "gpt-4", "litellm_params": {...}},
{"model_name": "gpt-3.5", "litellm_params": {...}},
{"model_name": "claude", "litellm_params": {...}}
],
# Global retry policy
retry_policy=RetryPolicy(
TimeoutErrorRetries=3,
RateLimitErrorRetries=2
),
# Model-specific retry policies
model_group_retry_policy={
"gpt-4": RetryPolicy(
TimeoutErrorRetries=5, # More retries for expensive model
RateLimitErrorRetries=10
),
"claude": RetryPolicy(
TimeoutErrorRetries=2,
RateLimitErrorRetries=1
)
}
)
Retry Timing
Retry After
Set minimum wait time before retrying:
router = Router(
model_list=[...],
retry_after=10 # Wait at least 10 seconds before retry
)
Exponential Backoff
LiteLLM uses exponential backoff automatically:
# Retry timing with exponential backoff:
# 1st retry: 0-2 seconds
# 2nd retry: 0-4 seconds
# 3rd retry: 0-8 seconds
# 4th retry: 0-16 seconds
# etc.
response = completion(
model="gpt-4",
messages=[{"role": "user", "content": "Hello!"}],
num_retries=4
)
Exponential backoff helps avoid overwhelming rate-limited services and increases the chance of successful retries.
Streaming with Retries
Retries work with streaming responses:
response = completion(
model="gpt-4",
messages=[{"role": "user", "content": "Write a story"}],
stream=True,
num_retries=3
)
for chunk in response:
print(chunk.choices[0].delta.content, end="")
Async Retries
Retries work seamlessly with async operations:
import asyncio
from litellm import acompletion
async def make_request():
response = await acompletion(
model="gpt-4",
messages=[{"role": "user", "content": "Hello!"}],
num_retries=5
)
return response
response = asyncio.run(make_request())
Monitoring Retries
Custom Retry Logging
from litellm.integrations import CustomLogger
import litellm
class RetryLogger(CustomLogger):
def log_failure_event(self, kwargs, response_obj, start_time, end_time):
print(f"Request failed: {kwargs.get('model')}")
print(f"Exception: {kwargs.get('exception')}")
print(f"Retry attempt: {kwargs.get('litellm_call_id')}")
def log_success_event(self, kwargs, response_obj, start_time, end_time):
print(f"Request succeeded after retries")
litellm.callbacks = [RetryLogger()]
Retry information is included in response metadata:
response = router.completion(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}],
num_retries=3
)
# Check retry headers
retry_count = response._hidden_params.get("retry_count", 0)
print(f"Number of retries: {retry_count}")
Retry vs Fallback
Understanding the Difference
Retries: Attempt the same model/deployment multiple timesFallbacks: Switch to a different model/deployment after retries failExecution Order: LiteLLM tries retries first, then fallbacks
Combined Retry and Fallback
router = Router(
model_list=[
{"model_name": "gpt-4", "litellm_params": {...}},
{"model_name": "gpt-3.5", "litellm_params": {...}}
],
num_retries=3, # Retry each model 3 times
fallbacks=[{"gpt-4": ["gpt-3.5"]}] # Then fallback to gpt-3.5
)
# Execution flow:
# 1. Try gpt-4 (retry up to 3 times)
# 2. If all retries fail, try gpt-3.5 (retry up to 3 times)
Best Practices
-
Set appropriate retry counts
- More retries for critical requests
- Fewer retries for latency-sensitive applications
-
Configure by error type
- Retry timeouts and rate limits aggressively
- Don’t retry authentication or validation errors
-
Use exponential backoff
- Already built-in, respects API rate limits
-
Monitor retry rates
- High retry rates indicate underlying issues
- Track which models/deployments need retries
-
Combine with fallbacks
- Use retries for transient failures
- Use fallbacks for persistent failures
-
Set timeouts
- Prevent retries from taking too long
response = completion(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}],
num_retries=3,
timeout=30 # Total timeout for all retries
)
Common Retry Scenarios
Rate Limit Handling
from litellm import Router
from litellm.types.router import RetryPolicy
router = Router(
model_list=[...],
retry_policy=RetryPolicy(
RateLimitErrorRetries=10 # Retry rate limits many times
),
retry_after=60 # Wait 1 minute before first retry
)
Timeout Handling
router = Router(
model_list=[...],
retry_policy=RetryPolicy(
TimeoutErrorRetries=5
),
timeout=30, # 30 second timeout per request
stream_timeout=60 # 60 second timeout for streaming
)
Production Configuration
router = Router(
model_list=[...],
retry_policy=RetryPolicy(
TimeoutErrorRetries=3,
RateLimitErrorRetries=5,
InternalServerErrorRetries=2,
ContentPolicyViolationErrorRetries=0,
AuthenticationErrorRetries=0
),
num_retries=3,
retry_after=5,
timeout=60
)
Disabling Retries
# Disable retries globally
litellm.num_retries = 0
# Or per request
response = completion(
model="gpt-4",
messages=[{"role": "user", "content": "Hello!"}],
num_retries=0
)
Error Handling
from litellm import completion
from litellm.exceptions import (
Timeout,
RateLimitError,
ServiceUnavailableError
)
try:
response = completion(
model="gpt-4",
messages=[{"role": "user", "content": "Hello!"}],
num_retries=3
)
except RateLimitError as e:
print(f"Rate limited after retries: {str(e)}")
except Timeout as e:
print(f"Timeout after retries: {str(e)}")
except Exception as e:
print(f"Failed after all retries: {str(e)}")