Skip to main content

Overview

High availability (HA) ensures LiteLLM remains operational during:
  • Infrastructure failures
  • Network outages
  • Database issues
  • Provider API failures
  • Traffic spikes
Target SLAs:
  • Uptime: 99.9% (8.76 hours downtime/year)
  • Latency: P95 < 5 seconds
  • Error rate: < 0.1%

Architecture Patterns

Single Region HA

Components:
  Load Balancer (ALB/NLB)

    ├─ LiteLLM Pod 1 (AZ-1)
    ├─ LiteLLM Pod 2 (AZ-2)  
    └─ LiteLLM Pod 3 (AZ-3)

    ├─ PostgreSQL Primary (AZ-1)
    └─ PostgreSQL Replica (AZ-2)

    ├─ Redis Primary (AZ-1)
    └─ Redis Replica (AZ-2)
Benefits:
  • Zone-level fault tolerance
  • Lower latency within region
  • Simpler to manage
Limitations:
  • No region-level DR
  • Higher latency for distant users

Multi-Region Active-Passive

Region 1 (Primary - US-East):
  LiteLLM Pods (3x)
  PostgreSQL Primary
  Redis Primary

  │ Replication

  
Region 2 (Standby - US-West):
  LiteLLM Pods (0-1x, auto-start on failover)
  PostgreSQL Read Replica
  Redis Replica
Benefits:
  • Region-level DR
  • Lower cost (standby minimal)
  • Simple failover
Limitations:
  • Manual failover may be required
  • RTO: 2-5 minutes
  • RPO: 30-60 seconds

Multi-Region Active-Active

Global Load Balancer (GeoDNS)

  ├─ US-East
  │   ├─ LiteLLM Pods (3x)
  │   ├─ PostgreSQL Primary
  │   └─ Redis Primary

  ├─ EU-West  
  │   ├─ LiteLLM Pods (3x)
  │   ├─ PostgreSQL Primary
  │   └─ Redis Primary

  └─ AP-Southeast
      ├─ LiteLLM Pods (2x)
      ├─ PostgreSQL Primary  
      └─ Redis Primary

Database Replication:
  US-East ↔ EU-West ↔ AP-Southeast
  (Multi-master or primary with replicas)
Benefits:
  • Global low latency
  • No single point of failure
  • Automatic failover
  • RTO: 0 (instant)
  • RPO: 0 (no data loss)
Considerations:
  • Higher cost
  • Complex data consistency
  • Requires conflict resolution

Load Balancing

Application Load Balancer (ALB)

# Create target group
aws elbv2 create-target-group \
  --name litellm-tg \
  --protocol HTTP \
  --port 4000 \
  --vpc-id vpc-xxx \
  --health-check-path /health/liveliness \
  --health-check-interval-seconds 30 \
  --health-check-timeout-seconds 10 \
  --healthy-threshold-count 2 \
  --unhealthy-threshold-count 3

# Create ALB
aws elbv2 create-load-balancer \
  --name litellm-alb \
  --subnets subnet-xxx subnet-yyy subnet-zzz \
  --security-groups sg-xxx \
  --scheme internet-facing

# Create listener
aws elbv2 create-listener \
  --load-balancer-arn arn:aws:elasticloadbalancing:... \
  --protocol HTTPS \
  --port 443 \
  --certificates CertificateArn=arn:aws:acm:... \
  --default-actions Type=forward,TargetGroupArn=arn:aws:elasticloadbalancing:...

Routing Strategies

Round Robin (default):
Prometrics:
  Simple, even distribution
Cons:
  No consideration for backend load/latency
Least Connections:
upstream litellm {
    least_conn;
    server litellm-1:4000;
    server litellm-2:4000;
    server litellm-3:4000;
}
Latency-Based (recommended):
# NGINX with EWMA (Exponentially Weighted Moving Average)
upstream litellm {
    random two least_time=header;
    server litellm-1:4000;
    server litellm-2:4000;
    server litellm-3:4000;
}
Sticky Sessions:
upstream litellm {
    ip_hash;  # Or use cookies
    server litellm-1:4000;
    server litellm-2:4000;
    server litellm-3:4000;
}

Database High Availability

PostgreSQL Replication

# Create primary instance
aws rds create-db-instance \
  --db-instance-identifier litellm-db-primary \
  --db-instance-class db.r6g.large \
  --engine postgres \
  --engine-version 16.1 \
  --master-username litellm \
  --master-user-password YourPassword \
  --allocated-storage 100 \
  --storage-type gp3 \
  --storage-encrypted \
  --backup-retention-period 7 \
  --multi-az  # High availability

# Create read replicas
aws rds create-db-instance-read-replica \
  --db-instance-identifier litellm-db-replica-1 \
  --source-db-instance-identifier litellm-db-primary \
  --availability-zone us-east-1b

aws rds create-db-instance-read-replica \
  --db-instance-identifier litellm-db-replica-2 \
  --source-db-instance-identifier litellm-db-primary \
  --availability-zone us-east-1c

Connection Pooling

PgBouncer for connection pooling:
pgbouncer.ini
[databases]
litellm = host=postgres-primary port=5432 dbname=litellm

[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt

# Connection pooling
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25
reserve_pool_size = 5
reserve_pool_timeout = 3

# Performance
server_idle_timeout = 600
server_lifetime = 3600
server_connect_timeout = 15
Deploy with LiteLLM:
docker-compose.yml
services:
  pgbouncer:
    image: pgbouncer/pgbouncer:latest
    ports:
      - "6432:6432"
    volumes:
      - ./pgbouncer.ini:/etc/pgbouncer/pgbouncer.ini
      - ./userlist.txt:/etc/pgbouncer/userlist.txt
    environment:
      DATABASES: litellm=host=postgres-primary port=5432 dbname=litellm
  
  litellm:
    image: ghcr.io/berriai/litellm:main-stable
    environment:
      DATABASE_URL: postgresql://litellm:pass@pgbouncer:6432/litellm

Redis High Availability

Redis Sentinel

docker-compose.yml
services:
  redis-master:
    image: redis:7-alpine
    command: redis-server --appendonly yes
    volumes:
      - redis-master-data:/data
  
  redis-replica-1:
    image: redis:7-alpine
    command: redis-server --slaveof redis-master 6379 --appendonly yes
    volumes:
      - redis-replica-1-data:/data
    depends_on:
      - redis-master
  
  redis-sentinel-1:
    image: redis:7-alpine
    command: >
      sh -c '
        echo "port 26379" > /tmp/sentinel.conf &&
        echo "sentinel monitor mymaster redis-master 6379 2" >> /tmp/sentinel.conf &&
        echo "sentinel down-after-milliseconds mymaster 5000" >> /tmp/sentinel.conf &&
        echo "sentinel parallel-syncs mymaster 1" >> /tmp/sentinel.conf &&
        echo "sentinel failover-timeout mymaster 10000" >> /tmp/sentinel.conf &&
        redis-sentinel /tmp/sentinel.conf
      '
    ports:
      - "26379:26379"
    depends_on:
      - redis-master
      - redis-replica-1
LiteLLM config with Sentinel:
config.yaml
general_settings:
  cache: true
  redis_host: os.environ/REDIS_SENTINEL_HOST
  redis_port: 26379
  redis_password: os.environ/REDIS_PASSWORD
  redis_use_sentinel: true
  redis_sentinel_master_name: mymaster

Redis Cluster

For higher throughput:
# Create 6-node cluster (3 masters, 3 replicas)
redis-cli --cluster create \
  redis-1:6379 redis-2:6379 redis-3:6379 \
  redis-4:6379 redis-5:6379 redis-6:6379 \
  --cluster-replicas 1

Provider Failover

Multi-Provider Configuration

Configure multiple providers for the same model to ensure availability if one provider has an outage.
config.yaml
model_list:
  # OpenAI GPT-4o
  - model_name: gpt-4o
    litellm_params:
      model: gpt-4o
      api_key: os.environ/OPENAI_API_KEY
  
  # Azure OpenAI GPT-4o (fallback)
  - model_name: gpt-4o
    litellm_params:
      model: azure/gpt-4o
      api_key: os.environ/AZURE_API_KEY
      api_base: os.environ/AZURE_API_BASE
      api_version: "2024-02-15-preview"
  
  # Anthropic Claude (alternative)
  - model_name: claude-sonnet-4
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY
  
  # AWS Bedrock Claude (fallback)
  - model_name: claude-sonnet-4
    litellm_params:
      model: bedrock/anthropic.claude-3-sonnet-20240229-v1:0
      aws_access_key_id: os.environ/AWS_ACCESS_KEY_ID
      aws_secret_access_key: os.environ/AWS_SECRET_ACCESS_KEY
      aws_region_name: us-east-1

router_settings:
  routing_strategy: latency-based-routing
  allowed_fails: 3  # Mark unhealthy after 3 failures
  cooldown_time: 60  # Retry after 60 seconds
  num_retries: 2  # Retry failed requests
  retry_after: 10  # Wait 10s between retries
  timeout: 30  # Request timeout

Retry Logic

config.yaml
router_settings:
  num_retries: 3
  retry_after: 5
  timeout: 30
  
  # Retry on specific errors
  retry_policy:
    - status_code: 429  # Rate limit
      retry_after: 60
    - status_code: 500  # Internal error
      retry_after: 10
    - status_code: 503  # Service unavailable
      retry_after: 30
  
  # Fallback models
  fallbacks:
    - model: gpt-4o
      fallback_models: ["azure/gpt-4o", "claude-sonnet-4"]

Autoscaling

Kubernetes HPA

hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: litellm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: litellm
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60

KEDA (Advanced)

keda-scaler.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: litellm-scaler
spec:
  scaleTargetRef:
    name: litellm
  minReplicaCount: 3
  maxReplicaCount: 50
  pollingInterval: 15
  cooldownPeriod: 300
  triggers:
    # Scale on request rate
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: litellm_requests_per_second
        threshold: '1000'
        query: |
          sum(rate(litellm_requests_total[2m]))
    
    # Scale on queue depth (if using queues)
    - type: rabbitmq
      metadata:
        queueName: litellm-requests
        queueLength: '50'
    
    # Scale on CPU
    - type: cpu
      metadataType: Utilization
      metadata:
        value: '70'

Disaster Recovery

Backup Strategy

Database backups:
# Automatic daily backups (retained 7 days)
aws rds modify-db-instance \
  --db-instance-identifier litellm-db \
  --backup-retention-period 30 \
  --preferred-backup-window "03:00-04:00"

# Manual snapshot
aws rds create-db-snapshot \
  --db-instance-identifier litellm-db \
  --db-snapshot-identifier litellm-snapshot-$(date +%Y%m%d)
Configuration backups:
# Backup config
kubectl get configmap litellm-config -o yaml > config-backup.yaml

# Backup secrets (encrypted)
kubectl get secrets litellm-secrets -o yaml | \
  openssl enc -aes-256-cbc -salt -out secrets-backup.enc

Recovery Procedures

Database recovery:
# Restore from snapshot (AWS RDS)
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier litellm-db-restored \
  --db-snapshot-identifier litellm-snapshot-20240101

# Restore from pg_dump
pg_restore -h postgres-primary -U litellm -d litellm \
  /backups/litellm_20240101.dump
Failover to standby region:
# 1. Update DNS to point to standby region
aws route53 change-resource-record-sets \
  --hosted-zone-id Z123456 \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "api.yourdomain.com",
        "Type": "CNAME",
        "TTL": 60,
        "ResourceRecords": [{"Value": "litellm-standby.region2.elb.amazonaws.com"}]
      }
    }]
  }'

# 2. Promote standby database to primary
aws rds promote-read-replica \
  --db-instance-identifier litellm-db-replica-region2

# 3. Scale up standby region
kubectl scale deployment litellm --replicas=5 -n litellm-standby

Monitoring and Alerting

Critical Alerts

prometheus-alerts.yml
groups:
  - name: litellm-ha
    rules:
      - alert: InstanceDown
        expr: up{job="litellm"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "LiteLLM instance {{ $labels.instance }} is down"
      
      - alert: HighErrorRate
        expr: |
          (sum(rate(litellm_requests_total{status!="success"}[5m])) /
           sum(rate(litellm_requests_total[5m]))) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 1%"
      
      - alert: DatabaseConnectionFailed
        expr: litellm_database_connection_status == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Database connection failed"
      
      - alert: AllProvidersDown
        expr: sum(litellm_model_health_status{status="healthy"}) == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "All LLM providers are unhealthy"

Testing HA Setup

Chaos Engineering

Simulate failures:
# Kill random pod
kubectl delete pod $(kubectl get pods -l app=litellm -o name | shuf -n1)

# Simulate network partition
kubectl exec -it litellm-xxx -- iptables -A OUTPUT -d postgres-primary -j DROP

# Overload with traffic
k6 run --vus 1000 --duration 5m load-test.js
Use Chaos Mesh (Kubernetes):
chaos-experiment.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: litellm-pod-failure
spec:
  action: pod-failure
  mode: one
  selector:
    namespaces:
      - litellm
    labelSelectors:
      app: litellm
  duration: '30s'

Disaster Recovery Drills

Monthly DR drill checklist:
1

Simulate Primary Region Failure

# Block traffic to primary region
kubectl scale deployment litellm --replicas=0 -n production
2

Verify Automatic Failover

  • Monitor traffic shifts to standby
  • Check error rates remain < 1%
  • Verify latency increases < 100ms
3

Test Database Failover

# Promote replica
aws rds promote-read-replica --db-instance-identifier litellm-db-replica
4

Validate Application

# Run smoke tests
curl -X POST https://api.yourdomain.com/v1/chat/completions \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "test"}]
  }'
5

Document Results

  • RTO achieved
  • RPO achieved
  • Issues encountered
  • Action items
6

Restore Primary

kubectl scale deployment litellm --replicas=5 -n production

Best Practices Summary

1

Design for Failure

  • Assume everything will fail
  • No single points of failure
  • Multiple availability zones/regions
  • Redundant components
2

Automate Recovery

  • Health checks at all levels
  • Automatic failover
  • Self-healing systems
  • Circuit breakers
3

Monitor Everything

  • Comprehensive metrics
  • Proactive alerting
  • Distributed tracing
  • Regular testing
4

Practice DR

  • Monthly DR drills
  • Documented runbooks
  • Automated testing
  • Post-mortem analysis

Next Steps

Monitoring

Set up comprehensive monitoring for HA

Performance

Optimize performance at scale

Security

Secure your HA deployment

Troubleshooting

Debug HA-specific issues

Build docs developers (and LLMs) love