Overview
Helicone AI Gateway provides automatic fallback capabilities to ensure your AI applications remain reliable even when individual providers fail. When a request fails, the gateway automatically tries alternative providers in the order you specify.
How Fallbacks Work
The gateway processes fallbacks in a predictable order:
Primary Attempt
The gateway tries the first model/provider in your list.
Failure Detection
If the request fails (rate limit, timeout, service error, etc.), the gateway moves to the next option.
Automatic Retry
The gateway automatically retries with the next model/provider in your list.
Success or Exhaustion
The process continues until a request succeeds or all options are exhausted.
Fallbacks work across both BYOK (Bring Your Own Key) and PTB (Pass-Through Billing) authentication methods.
Basic Fallback Configuration
Same Model, Different Providers
Route the same model through different providers:
const response = await client . chat . completions . create ({
model: "gpt-4o/openai,gpt-4o/azure,gpt-4o/deepinfra" ,
messages: [{ role: "user" , content: "Hello!" }],
});
Fallback chain:
OpenAI (primary)
Azure OpenAI (if OpenAI fails)
DeepInfra (if Azure fails)
Different Models
Fallback to different models:
const response = await client . chat . completions . create ({
model: "gpt-4o,gpt-4o-mini,claude-sonnet-4" ,
messages: [{ role: "user" , content: "Hello!" }],
});
Fallback chain:
GPT-4o (best available provider)
GPT-4o-mini (cheaper alternative)
Claude Sonnet 4 (different model family)
Cross-Provider Fallback
Fallback across different cloud providers:
const response = await client . chat . completions . create ({
model: "claude-3-7-sonnet-20250219/bedrock,claude-3-7-sonnet-20250219/anthropic,claude-3-7-sonnet-20250219/vertex" ,
messages: [{ role: "user" , content: "Hello!" }],
});
Fallback chain:
AWS Bedrock (primary)
Anthropic direct (if Bedrock fails)
Google Vertex AI (if Anthropic fails)
Common Fallback Patterns
Cost Optimization
Regional Resilience
Speed Optimization
BYOK + PTB
Try cheaper providers first, fallback to premium: const response = await client . chat . completions . create ({
model: "gpt-4o-mini/deepinfra,gpt-4o-mini/openai,gpt-4o" ,
messages: [{ role: "user" , content: "Hello!" }],
});
Pattern:
DeepInfra (lowest cost)
OpenAI standard (if DeepInfra unavailable)
GPT-4o (best quality, highest cost)
Fallback across regions for high availability: const response = await client . chat . completions . create ({
model: "us.anthropic.claude-3-7-sonnet-20250219-v1:0/bedrock,eu.anthropic.claude-3-7-sonnet-20250219-v1:0/bedrock,claude-3-7-sonnet-20250219/anthropic" ,
messages: [{ role: "user" , content: "Hello!" }],
});
Pattern:
US region Bedrock
EU region Bedrock
Anthropic global
Try fastest providers first: const response = await client . chat . completions . create ({
model: "llama-3.3-70b/groq,llama-3.3-70b/deepinfra,llama-3.3-70b/together" ,
messages: [{ role: "user" , content: "Hello!" }],
});
Pattern:
Groq (fastest inference)
DeepInfra (fast, cheaper)
Together AI (reliable)
Fallback from BYOK to PTB automatically: // With OpenAI BYOK key configured
const response = await client . chat . completions . create ({
model: "gpt-4o-mini" ,
messages: [{ role: "user" , content: "Hello!" }],
});
Automatic pattern:
OpenAI BYOK (your key)
OpenAI PTB (Helicone billing)
Alternative providers PTB
Failure Scenarios
The gateway automatically retries on these failure types:
Provider rate limit exceeded model : "gpt-4o/openai,gpt-4o/azure"
// If OpenAI returns 429, immediately tries Azure
Exception : Helicone-generated 429s (escrow failure, rate limits) bail immediately without trying fallbacks.
Authentication Errors (401, 403)
Invalid or expired provider keys model : "gpt-4o/openai,gpt-4o/azure"
// If OpenAI BYOK key is invalid, tries Azure
Service Errors (500, 502, 503)
Provider service unavailable model : "claude-sonnet-4/anthropic,claude-sonnet-4/bedrock"
// If Anthropic is down, tries Bedrock
Request timeout model : "llama-3.3-70b/groq,llama-3.3-70b/together"
// If Groq times out, tries Together AI
Model not available for PTB model : "special-model/bedrock,gpt-4o"
// If model is disallowed for PTB, tries next option
Advanced Fallback Strategies
Multi-Model Fallback
Combine multiple models and providers:
const response = await client . chat . completions . create ({
model: [
"gpt-4o/openai" , // Primary: GPT-4o on OpenAI
"gpt-4o/azure" , // Fallback 1: GPT-4o on Azure
"claude-sonnet-4/anthropic" , // Fallback 2: Claude on Anthropic
"gemini-2.0-flash/google-ai-studio" , // Fallback 3: Gemini
]. join ( "," ),
messages: [{ role: "user" , content: "Hello!" }],
});
Conditional Fallbacks
Choose fallback strategy based on context:
function getFallbackModel ( priority : "cost" | "speed" | "reliability" ) {
const strategies = {
cost: "gpt-4o-mini/deepinfra,gpt-4o-mini,claude-3-haiku" ,
speed: "llama-3.3-70b/groq,gpt-4o-mini/openai,gemini-2.0-flash" ,
reliability: "gpt-4o/openai,gpt-4o/azure,claude-sonnet-4/anthropic,gemini-2.0-flash/vertex" ,
};
return strategies [ priority ];
}
const response = await client . chat . completions . create ({
model: getFallbackModel ( "reliability" ),
messages: [{ role: "user" , content: "Hello!" }],
});
Provider Exclusions with Fallbacks
Exclude specific providers while maintaining fallbacks:
const response = await client . chat . completions . create ({
model: "!deepinfra,gpt-4o/openai,gpt-4o/azure" ,
messages: [{ role: "user" , content: "Hello!" }],
});
// Tries OpenAI and Azure, but never DeepInfra
Monitoring Fallbacks
Track fallback behavior in the Helicone dashboard:
Check Attempts
See all provider attempts and which one succeeded
Analyze Patterns
Which providers fail most often?
How many fallback attempts typically occur?
What’s the success rate by provider?
Error Responses
When all fallback attempts fail, the gateway returns a consolidated error:
{
"error" : {
"message" : "All fallback attempts failed" ,
"type" : "all_attempts_failed" ,
"attempts" : [
{
"source" : "gpt-4o/openai" ,
"error" : "Rate limit exceeded" ,
"status" : 429
},
{
"source" : "gpt-4o/azure" ,
"error" : "Service unavailable" ,
"status" : 503
}
]
}
}
Helicone 429s bail immediately: If Helicone returns a 429 (insufficient credits or rate limit), the request fails immediately without trying fallbacks. Add credits at helicone.ai/credits .
Fallback Best Practices
Start with Native Providers
List the native provider first for best compatibility: // Good: Native provider first
model : "gpt-4o/openai,gpt-4o/azure,gpt-4o/deepinfra"
// Avoid: Non-native provider first
model : "gpt-4o/deepinfra,gpt-4o/openai"
Balance Cost and Reliability
Mix low-cost and reliable providers: model : "gpt-4o-mini/deepinfra,gpt-4o-mini/openai,gpt-4o"
// Try cheap first, fallback to reliable
Keep Fallback Chains Short
2-4 fallbacks is usually sufficient: // Good: 3 fallbacks
model : "gpt-4o/openai,gpt-4o/azure,claude-sonnet-4"
// Too many: 7+ fallbacks may cause latency
model : "model1,model2,model3,model4,model5,model6,model7"
Monitor Fallback Frequency
If fallbacks trigger frequently:
Check provider status
Verify BYOK keys
Consider changing primary provider
Review rate limits
Monitor at Helicone Dashboard
Test Your Fallback Strategy
Test fallbacks by:
Using invalid BYOK keys temporarily
Requesting rate-limited models
Monitoring which providers succeed
Real-World Examples
Production-Grade Fallback
import OpenAI from "openai" ;
const client = new OpenAI ({
baseURL: "https://ai-gateway.helicone.ai" ,
apiKey: process . env . HELICONE_API_KEY ,
});
async function robustCompletion ( prompt : string ) {
try {
const response = await client . chat . completions . create ({
model: "gpt-4o/openai,gpt-4o/azure,claude-sonnet-4/anthropic,gemini-2.0-flash/vertex" ,
messages: [{ role: "user" , content: prompt }],
max_tokens: 1000 ,
});
return response . choices [ 0 ]. message . content ;
} catch ( error ) {
// All fallbacks failed
console . error ( "All providers failed:" , error );
throw error ;
}
}
Cost-Optimized with Fallback
async function costOptimizedCompletion ( prompt : string , budget : "low" | "high" ) {
const models = {
low: "gpt-4o-mini/deepinfra,gpt-4o-mini/openai" ,
high: "gpt-4o/openai,claude-sonnet-4/anthropic" ,
};
const response = await client . chat . completions . create ({
model: models [ budget ],
messages: [{ role: "user" , content: prompt }],
});
return response . choices [ 0 ]. message . content ;
}
Regional Resilience
async function regionalCompletion ( prompt : string , preferredRegion : "us" | "eu" ) {
const models = {
us: "us.anthropic.claude-3-7-sonnet-20250219-v1:0/bedrock,claude-3-7-sonnet-20250219/anthropic" ,
eu: "eu.anthropic.claude-3-7-sonnet-20250219-v1:0/bedrock,claude-3-7-sonnet-20250219/anthropic" ,
};
const response = await client . chat . completions . create ({
model: models [ preferredRegion ],
messages: [{ role: "user" , content: prompt }],
});
return response . choices [ 0 ]. message . content ;
}
Streaming with Fallbacks
Fallbacks work seamlessly with streaming:
const stream = await client . chat . completions . create ({
model: "gpt-4o/openai,gpt-4o/azure,claude-sonnet-4/anthropic" ,
messages: [{ role: "user" , content: "Tell me a story" }],
stream: true ,
});
for await ( const chunk of stream ) {
process . stdout . write ( chunk . choices [ 0 ]?. delta ?. content || "" );
}
If the first provider fails, the stream automatically switches to the next provider.
Next Steps
Routing Learn more about provider routing
Getting Started Set up the AI Gateway
Monitor Requests Track fallback behavior
Browse Models Explore available models