Why multi-LLM failover?
Relying on a single LLM provider creates several risks:Rate limits
Free tiers have strict quotas. Paid tiers can hit limits during traffic spikes.
Service outages
Even major providers experience downtime. Multi-provider setup ensures availability.
Geographic restrictions
Some AI services are unavailable in certain regions or countries.
Cost optimization
Route to cheaper providers when primary is unavailable or over budget.
This system implements a hierarchical failover pattern: Try Google Gemini models first, then fall back to Anthropic Claude if all Gemini attempts fail.
Failover architecture
TheLLMService class manages multiple providers and models in priority order (app/services/llm_service.py:11):
- Gemini 2.5 Flash
- Gemini Flash Latest
- Gemini 2.5 Pro
- Claude 3 Haiku
- Claude 3.5 Sonnet
Implementation
Thegenerate_answer method iterates through providers and models until one succeeds (app/services/llm_service.py:64):
- Automatic retry: If one model fails, immediately tries the next
- Error categorization: Distinguishes network errors from other failures
- Graceful degradation: Returns a user-friendly message if all providers fail
- Transparency: Response includes which provider/model generated it
Provider implementations
Google Gemini
Anthropic Claude
Error handling strategy
The system catches specific exceptions to determine whether to retry:APIStatusError
APIStatusError
Cause: HTTP 4xx/5xx errors from the APIExamples:
- 429 Too Many Requests (rate limit)
- 500 Internal Server Error
- 503 Service Unavailable
APIConnectionError
APIConnectionError
Cause: Network connectivity issuesExamples:
- DNS resolution failure
- Connection timeout
- SSL/TLS errors
Generic Exception
Generic Exception
Cause: Unexpected errorsExamples:
- Invalid API key
- Malformed response
- JSON parsing errors
Failover example
Let’s trace a request where Gemini is rate-limited:First attempt: Gemini 2.5 Flash
APIStatusError: 429 Too Many RequestsAction: Print warning, continue to next modelSecond attempt: Gemini Flash Latest
APIConnectionError: Connection timeoutAction: Print warning, continue to next modelThird attempt: Gemini 2.5 Pro
APIStatusError: 503 Service UnavailableAction: All Google models failed, switch providersConfiguration best practices
Model ordering strategy
Model ordering strategy
Order models by:
- Latency: Faster models first for better UX
- Cost: Cheaper models first to minimize expenses
- Capability: Reserve powerful models for when cheaper ones fail
Timeout configuration
Timeout configuration
The current implementation doesn’t set explicit timeouts. Consider adding:This prevents hanging on slow providers.
Circuit breaker pattern
Circuit breaker pattern
For production systems, consider implementing circuit breakers:
- Open: After N consecutive failures, stop trying a provider temporarily
- Half-open: After cooldown period, try again
- Closed: Provider is working normally
Monitoring and alerting
Monitoring and alerting
Track these metrics:
- Provider success rate: Percentage of requests handled by each provider
- Failover frequency: How often failover is triggered
- Complete failure rate: Requests where all providers failed
- Latency by provider: Response time distribution
- Primary provider success rate drops below 95%
- Complete failure rate exceeds 1%
- Latency exceeds thresholds
Adding new providers
The system is designed for easy extension. To add OpenAI as a third provider:
The failover logic automatically handles the new provider.
Embedding failover
Embeddings require consistency - you can’t mix vectors from different models in the same search space. Options:- Accept the risk: Gemini embeddings have high availability
- Dual embedding storage: Store embeddings from multiple models (increases storage)
- Offline fallback: Cache embeddings and serve stale results during Gemini outages
Testing failover
To test failover behavior:- Console logs showing Gemini failures
- Automatic failover to Claude
- Response with
[ANTHROPIC - ...]prefix
Next steps
RAG pattern
Understand how multi-LLM failover works with RAG
Architecture overview
See how failover fits into the complete system
Environment setup
Configure API keys for multiple providers
Docker deployment
Deploy with failover configuration