Skip to main content

Why multi-LLM failover?

Relying on a single LLM provider creates several risks:

Rate limits

Free tiers have strict quotas. Paid tiers can hit limits during traffic spikes.

Service outages

Even major providers experience downtime. Multi-provider setup ensures availability.

Geographic restrictions

Some AI services are unavailable in certain regions or countries.

Cost optimization

Route to cheaper providers when primary is unavailable or over budget.
This system implements a hierarchical failover pattern: Try Google Gemini models first, then fall back to Anthropic Claude if all Gemini attempts fail.

Failover architecture

The LLMService class manages multiple providers and models in priority order (app/services/llm_service.py:11):
class LLMService:
    # Hierarchical configuration
    LLM_CONFIG = [
        {
            "provider": "google",
            "models": [
                'models/gemini-2.5-flash',      # Fastest, try first
                'models/gemini-flash-latest',   # Fallback 1
                'models/gemini-2.5-pro'         # Fallback 2 (slower but more capable)
            ]
        },
        {
            "provider": "anthropic",
            "models": [
                'claude-3-haiku-20240307',      # Fast and cheap
                'claude-3-5-sonnet-20240620'    # Most capable
            ]
        }
    ]
Failover order:
  1. Gemini 2.5 Flash
  2. Gemini Flash Latest
  3. Gemini 2.5 Pro
  4. Claude 3 Haiku
  5. Claude 3.5 Sonnet
Models are ordered by speed and cost. Fast, cheap models are tried first. Slower, more capable models are reserves.

Implementation

The generate_answer method iterates through providers and models until one succeeds (app/services/llm_service.py:64):
@staticmethod
def generate_answer(query: str, context: str) -> str:
    prompt = (
        f"Eres un analista de Listo ERP. Basado en este contexto:\n{context}\n\n"
        f"Pregunta: {query}\nRespuesta profesional y breve:"
    )

    # Try each provider and model in sequence
    for entry in LLMService.LLM_CONFIG:
        provider = entry["provider"]
        for model_name in entry["models"]:
            try:
                # Route to appropriate provider
                if provider == "google":
                    res = LLMService._call_google(model_name, prompt)
                elif provider == "anthropic":
                    res = LLMService._call_anthropic(model_name, prompt)
                
                # Success! Return with provider metadata
                return f"[{provider.upper()} - {model_name}] {res}"
            
            except (APIStatusError, APIConnectionError) as e:
                # Network or API errors - try next model
                print(f"⚠️ Network/status error in {provider} ({model_name}): {e}")
                continue 
            except Exception as e:
                # Other errors - try next model
                print(f"❌ Unexpected error in {model_name}: {str(e)[:50]}")
                continue

    # All providers failed
    return "Lo sentimos, el servicio de recomendaciones no está disponible."
Key features:
  • Automatic retry: If one model fails, immediately tries the next
  • Error categorization: Distinguishes network errors from other failures
  • Graceful degradation: Returns a user-friendly message if all providers fail
  • Transparency: Response includes which provider/model generated it

Provider implementations

Google Gemini

@staticmethod
def _call_google(model_name: str, prompt: str):
    model = genai.GenerativeModel(model_name)
    response = model.generate_content(prompt)
    return response.text
Simple synchronous call to Google’s Generative AI SDK.

Anthropic Claude

@staticmethod
def _call_anthropic(model_name: str, prompt: str):
    message = anthropic_client.messages.create(
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
        model=model_name,
    )
    # Anthropic returns content as a list of blocks
    return message.content[0].text
Uses Anthropic’s Messages API with explicit token limits.
Different providers have different response formats. The wrapper methods normalize these into consistent text strings.

Error handling strategy

The system catches specific exceptions to determine whether to retry:
Cause: HTTP 4xx/5xx errors from the APIExamples:
  • 429 Too Many Requests (rate limit)
  • 500 Internal Server Error
  • 503 Service Unavailable
Action: Try next provider/model
Cause: Network connectivity issuesExamples:
  • DNS resolution failure
  • Connection timeout
  • SSL/TLS errors
Action: Try next provider/model
Cause: Unexpected errorsExamples:
  • Invalid API key
  • Malformed response
  • JSON parsing errors
Action: Log error and try next provider/model

Failover example

Let’s trace a request where Gemini is rate-limited:
1

First attempt: Gemini 2.5 Flash

res = LLMService._call_google('models/gemini-2.5-flash', prompt)
Result: APIStatusError: 429 Too Many RequestsAction: Print warning, continue to next model
2

Second attempt: Gemini Flash Latest

res = LLMService._call_google('models/gemini-flash-latest', prompt)
Result: APIConnectionError: Connection timeoutAction: Print warning, continue to next model
3

Third attempt: Gemini 2.5 Pro

res = LLMService._call_google('models/gemini-2.5-pro', prompt)
Result: APIStatusError: 503 Service UnavailableAction: All Google models failed, switch providers
4

Fourth attempt: Claude 3 Haiku

res = LLMService._call_anthropic('claude-3-haiku-20240307', prompt)
Result: Success!Return: "[ANTHROPIC - claude-3-haiku-20240307] Based on the context..."
The entire failover process is transparent to the calling code. From the endpoint’s perspective:
ai_recommendation = LLMService.generate_answer(search_data.query, context)
Always returns either a recommendation or a graceful failure message.

Configuration best practices

Order models by:
  1. Latency: Faster models first for better UX
  2. Cost: Cheaper models first to minimize expenses
  3. Capability: Reserve powerful models for when cheaper ones fail
Example: Gemini Flash (fast, cheap) → Claude Haiku (fast, cheap) → Gemini Pro (slower, capable) → Claude Sonnet (most capable)
The current implementation doesn’t set explicit timeouts. Consider adding:
import httpx

# Configure Anthropic client with timeout
anthropic_client = Anthropic(
    api_key=settings.ANTHROPIC_API_KEY,
    timeout=httpx.Timeout(30.0, connect=5.0)
)
This prevents hanging on slow providers.
For production systems, consider implementing circuit breakers:
  • Open: After N consecutive failures, stop trying a provider temporarily
  • Half-open: After cooldown period, try again
  • Closed: Provider is working normally
This prevents wasting time on consistently failing providers.
Track these metrics:
  • Provider success rate: Percentage of requests handled by each provider
  • Failover frequency: How often failover is triggered
  • Complete failure rate: Requests where all providers failed
  • Latency by provider: Response time distribution
Alert when:
  • Primary provider success rate drops below 95%
  • Complete failure rate exceeds 1%
  • Latency exceeds thresholds

Adding new providers

The system is designed for easy extension. To add OpenAI as a third provider:
1

Install SDK

pip install openai
2

Add to configuration

LLM_CONFIG = [
    {"provider": "google", "models": [...]},
    {"provider": "anthropic", "models": [...]},
    {
        "provider": "openai",
        "models": ['gpt-4o-mini', 'gpt-4o']
    }
]
3

Implement provider method

from openai import OpenAI

openai_client = OpenAI(api_key=settings.OPENAI_API_KEY)

@staticmethod
def _call_openai(model_name: str, prompt: str):
    response = openai_client.chat.completions.create(
        model=model_name,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=1024
    )
    return response.choices[0].message.content
4

Add routing logic

if provider == "google":
    res = LLMService._call_google(model_name, prompt)
elif provider == "anthropic":
    res = LLMService._call_anthropic(model_name, prompt)
elif provider == "openai":
    res = LLMService._call_openai(model_name, prompt)
5

Update environment variables

OPENAI_API_KEY="sk-..."
The failover logic automatically handles the new provider.

Embedding failover

Important limitation: Currently, embeddings only use Google Gemini. If Gemini embeddings fail, the entire search fails.
Embeddings require consistency - you can’t mix vectors from different models in the same search space. Options:
  1. Accept the risk: Gemini embeddings have high availability
  2. Dual embedding storage: Store embeddings from multiple models (increases storage)
  3. Offline fallback: Cache embeddings and serve stale results during Gemini outages

Testing failover

To test failover behavior:
# Temporarily break Gemini by using invalid API key
import os
os.environ['GEMINI_API_KEY'] = 'invalid_key'

# Make a search request
curl -X POST "http://localhost:8000/products/search" \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"query": "test", "limit": 3}'
You should see:
  • Console logs showing Gemini failures
  • Automatic failover to Claude
  • Response with [ANTHROPIC - ...] prefix

Next steps

RAG pattern

Understand how multi-LLM failover works with RAG

Architecture overview

See how failover fits into the complete system

Environment setup

Configure API keys for multiple providers

Docker deployment

Deploy with failover configuration

Build docs developers (and LLMs) love