High Availability Deployment

Overview

High availability (HA) ensures LiteLLM remains operational during:

Infrastructure failures
Network outages
Database issues
Provider API failures
Traffic spikes

Target SLAs:

Uptime: 99.9% (8.76 hours downtime/year)
Latency: P95 < 5 seconds
Error rate: < 0.1%

Architecture Patterns

Single Region HA

Components:
  Load Balancer (ALB/NLB)
    │
    ├─ LiteLLM Pod 1 (AZ-1)
    ├─ LiteLLM Pod 2 (AZ-2)  
    └─ LiteLLM Pod 3 (AZ-3)
    │
    ├─ PostgreSQL Primary (AZ-1)
    └─ PostgreSQL Replica (AZ-2)
    │
    ├─ Redis Primary (AZ-1)
    └─ Redis Replica (AZ-2)

Benefits:

Zone-level fault tolerance
Lower latency within region
Simpler to manage

Limitations:

No region-level DR
Higher latency for distant users

Multi-Region Active-Passive

Region 1 (Primary - US-East):
  LiteLLM Pods (3x)
  PostgreSQL Primary
  Redis Primary
  │
  │ Replication
  ↓
  
Region 2 (Standby - US-West):
  LiteLLM Pods (0-1x, auto-start on failover)
  PostgreSQL Read Replica
  Redis Replica

Benefits:

Region-level DR
Lower cost (standby minimal)
Simple failover

Limitations:

Manual failover may be required
RTO: 2-5 minutes
RPO: 30-60 seconds

Multi-Region Active-Active

Global Load Balancer (GeoDNS)
  │
  ├─ US-East
  │   ├─ LiteLLM Pods (3x)
  │   ├─ PostgreSQL Primary
  │   └─ Redis Primary
  │
  ├─ EU-West  
  │   ├─ LiteLLM Pods (3x)
  │   ├─ PostgreSQL Primary
  │   └─ Redis Primary
  │
  └─ AP-Southeast
      ├─ LiteLLM Pods (2x)
      ├─ PostgreSQL Primary  
      └─ Redis Primary

Database Replication:
  US-East ↔ EU-West ↔ AP-Southeast
  (Multi-master or primary with replicas)

Benefits:

Global low latency
No single point of failure
Automatic failover
RTO: 0 (instant)
RPO: 0 (no data loss)

Considerations:

Higher cost
Complex data consistency
Requires conflict resolution

Load Balancing

Application Load Balancer (ALB)

AWS
GCP
Kubernetes Ingress

# Create target group
aws elbv2 create-target-group \
  --name litellm-tg \
  --protocol HTTP \
  --port 4000 \
  --vpc-id vpc-xxx \
  --health-check-path /health/liveliness \
  --health-check-interval-seconds 30 \
  --health-check-timeout-seconds 10 \
  --healthy-threshold-count 2 \
  --unhealthy-threshold-count 3

# Create ALB
aws elbv2 create-load-balancer \
  --name litellm-alb \
  --subnets subnet-xxx subnet-yyy subnet-zzz \
  --security-groups sg-xxx \
  --scheme internet-facing

# Create listener
aws elbv2 create-listener \
  --load-balancer-arn arn:aws:elasticloadbalancing:... \
  --protocol HTTPS \
  --port 443 \
  --certificates CertificateArn=arn:aws:acm:... \
  --default-actions Type=forward,TargetGroupArn=arn:aws:elasticloadbalancing:...

# Create health check
gcloud compute health-checks create http litellm-health \
  --port 4000 \
  --request-path /health/liveliness

# Create backend service
gcloud compute backend-services create litellm-backend \
  --protocol HTTP \
  --health-checks litellm-health \
  --global

# Add instance groups
gcloud compute backend-services add-backend litellm-backend \
  --instance-group litellm-ig-us \
  --instance-group-zone us-east1-b \
  --global

# Create URL map and HTTPS proxy
gcloud compute url-maps create litellm-lb \
  --default-service litellm-backend

gcloud compute target-https-proxies create litellm-https-proxy \
  --url-map litellm-lb \
  --ssl-certificates litellm-cert

# Create forwarding rule
gcloud compute forwarding-rules create litellm-https \
  --address litellm-ip \
  --global \
  --target-https-proxy litellm-https-proxy \
  --ports 443

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: litellm-ingress
  annotations:
    kubernetes.io/ingress.class: nginx
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/upstream-hash-by: "$binary_remote_addr"
    nginx.ingress.kubernetes.io/load-balance: "ewma"  # Latency-based
spec:
  tls:
    - hosts:
        - api.yourdomain.com
      secretName: litellm-tls
  rules:
    - host: api.yourdomain.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: litellm
                port:
                  number: 4000

Routing Strategies

Round Robin (default):

Prometrics:
  Simple, even distribution
Cons:
  No consideration for backend load/latency

Least Connections:

upstream litellm {
    least_conn;
    server litellm-1:4000;
    server litellm-2:4000;
    server litellm-3:4000;
}

Latency-Based (recommended):

# NGINX with EWMA (Exponentially Weighted Moving Average)
upstream litellm {
    random two least_time=header;
    server litellm-1:4000;
    server litellm-2:4000;
    server litellm-3:4000;
}

Sticky Sessions:

upstream litellm {
    ip_hash;  # Or use cookies
    server litellm-1:4000;
    server litellm-2:4000;
    server litellm-3:4000;
}

Database High Availability

PostgreSQL Replication

AWS RDS
GCP Cloud SQL
Self-Managed (Patroni)

# Create primary instance
aws rds create-db-instance \
  --db-instance-identifier litellm-db-primary \
  --db-instance-class db.r6g.large \
  --engine postgres \
  --engine-version 16.1 \
  --master-username litellm \
  --master-user-password YourPassword \
  --allocated-storage 100 \
  --storage-type gp3 \
  --storage-encrypted \
  --backup-retention-period 7 \
  --multi-az  # High availability

# Create read replicas
aws rds create-db-instance-read-replica \
  --db-instance-identifier litellm-db-replica-1 \
  --source-db-instance-identifier litellm-db-primary \
  --availability-zone us-east-1b

aws rds create-db-instance-read-replica \
  --db-instance-identifier litellm-db-replica-2 \
  --source-db-instance-identifier litellm-db-primary \
  --availability-zone us-east-1c

# Create HA instance (automatic failover)
gcloud sql instances create litellm-db \
  --database-version=POSTGRES_16 \
  --tier=db-custom-4-16384 \
  --region=us-central1 \
  --availability-type=REGIONAL \
  --backup \
  --backup-start-time=03:00

# Create read replicas
gcloud sql instances create litellm-db-read-1 \
  --master-instance-name=litellm-db \
  --tier=db-custom-2-8192 \
  --replica-type=READ \
  --zone=us-central1-b

patroni.yml

scope: litellm
namespace: /litellm/
name: postgres-1

restapi:
  listen: 0.0.0.0:8008
  connect_address: postgres-1:8008

etcd:
  hosts: etcd-1:2379,etcd-2:2379,etcd-3:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    postgresql:
      use_pg_rewind: true
      parameters:
        wal_level: replica
        hot_standby: "on"
        max_wal_senders: 10
        max_replication_slots: 10

postgresql:
  listen: 0.0.0.0:5432
  connect_address: postgres-1:5432
  data_dir: /var/lib/postgresql/16/main
  authentication:
    replication:
      username: replicator
      password: RepPassword
    superuser:
      username: postgres
      password: PostgresPassword

Connection Pooling

PgBouncer for connection pooling:

pgbouncer.ini

[databases]
litellm = host=postgres-primary port=5432 dbname=litellm

[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt

# Connection pooling
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25
reserve_pool_size = 5
reserve_pool_timeout = 3

# Performance
server_idle_timeout = 600
server_lifetime = 3600
server_connect_timeout = 15

Deploy with LiteLLM:

docker-compose.yml

services:
  pgbouncer:
    image: pgbouncer/pgbouncer:latest
    ports:
      - "6432:6432"
    volumes:
      - ./pgbouncer.ini:/etc/pgbouncer/pgbouncer.ini
      - ./userlist.txt:/etc/pgbouncer/userlist.txt
    environment:
      DATABASES: litellm=host=postgres-primary port=5432 dbname=litellm
  
  litellm:
    image: ghcr.io/berriai/litellm:main-stable
    environment:
      DATABASE_URL: postgresql://litellm:pass@pgbouncer:6432/litellm

Redis High Availability

Redis Sentinel

docker-compose.yml

services:
  redis-master:
    image: redis:7-alpine
    command: redis-server --appendonly yes
    volumes:
      - redis-master-data:/data
  
  redis-replica-1:
    image: redis:7-alpine
    command: redis-server --slaveof redis-master 6379 --appendonly yes
    volumes:
      - redis-replica-1-data:/data
    depends_on:
      - redis-master
  
  redis-sentinel-1:
    image: redis:7-alpine
    command: >
      sh -c '
        echo "port 26379" > /tmp/sentinel.conf &&
        echo "sentinel monitor mymaster redis-master 6379 2" >> /tmp/sentinel.conf &&
        echo "sentinel down-after-milliseconds mymaster 5000" >> /tmp/sentinel.conf &&
        echo "sentinel parallel-syncs mymaster 1" >> /tmp/sentinel.conf &&
        echo "sentinel failover-timeout mymaster 10000" >> /tmp/sentinel.conf &&
        redis-sentinel /tmp/sentinel.conf
      '
    ports:
      - "26379:26379"
    depends_on:
      - redis-master
      - redis-replica-1

LiteLLM config with Sentinel:

config.yaml

general_settings:
  cache: true
  redis_host: os.environ/REDIS_SENTINEL_HOST
  redis_port: 26379
  redis_password: os.environ/REDIS_PASSWORD
  redis_use_sentinel: true
  redis_sentinel_master_name: mymaster

Redis Cluster

For higher throughput:

# Create 6-node cluster (3 masters, 3 replicas)
redis-cli --cluster create \
  redis-1:6379 redis-2:6379 redis-3:6379 \
  redis-4:6379 redis-5:6379 redis-6:6379 \
  --cluster-replicas 1

Provider Failover

Multi-Provider Configuration

Configure multiple providers for the same model to ensure availability if one provider has an outage.

config.yaml

model_list:
  # OpenAI GPT-4o
  - model_name: gpt-4o
    litellm_params:
      model: gpt-4o
      api_key: os.environ/OPENAI_API_KEY
  
  # Azure OpenAI GPT-4o (fallback)
  - model_name: gpt-4o
    litellm_params:
      model: azure/gpt-4o
      api_key: os.environ/AZURE_API_KEY
      api_base: os.environ/AZURE_API_BASE
      api_version: "2024-02-15-preview"
  
  # Anthropic Claude (alternative)
  - model_name: claude-sonnet-4
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY
  
  # AWS Bedrock Claude (fallback)
  - model_name: claude-sonnet-4
    litellm_params:
      model: bedrock/anthropic.claude-3-sonnet-20240229-v1:0
      aws_access_key_id: os.environ/AWS_ACCESS_KEY_ID
      aws_secret_access_key: os.environ/AWS_SECRET_ACCESS_KEY
      aws_region_name: us-east-1

router_settings:
  routing_strategy: latency-based-routing
  allowed_fails: 3  # Mark unhealthy after 3 failures
  cooldown_time: 60  # Retry after 60 seconds
  num_retries: 2  # Retry failed requests
  retry_after: 10  # Wait 10s between retries
  timeout: 30  # Request timeout

Retry Logic

config.yaml

router_settings:
  num_retries: 3
  retry_after: 5
  timeout: 30
  
  # Retry on specific errors
  retry_policy:
    - status_code: 429  # Rate limit
      retry_after: 60
    - status_code: 500  # Internal error
      retry_after: 10
    - status_code: 503  # Service unavailable
      retry_after: 30
  
  # Fallback models
  fallbacks:
    - model: gpt-4o
      fallback_models: ["azure/gpt-4o", "claude-sonnet-4"]

Autoscaling

Kubernetes HPA

hpa.yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: litellm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: litellm
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60

KEDA (Advanced)

keda-scaler.yaml

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: litellm-scaler
spec:
  scaleTargetRef:
    name: litellm
  minReplicaCount: 3
  maxReplicaCount: 50
  pollingInterval: 15
  cooldownPeriod: 300
  triggers:
    # Scale on request rate
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: litellm_requests_per_second
        threshold: '1000'
        query: |
          sum(rate(litellm_requests_total[2m]))
    
    # Scale on queue depth (if using queues)
    - type: rabbitmq
      metadata:
        queueName: litellm-requests
        queueLength: '50'
    
    # Scale on CPU
    - type: cpu
      metadataType: Utilization
      metadata:
        value: '70'

Disaster Recovery

Backup Strategy

Database backups:

Automated (AWS RDS)
pg_dump

# Automatic daily backups (retained 7 days)
aws rds modify-db-instance \
  --db-instance-identifier litellm-db \
  --backup-retention-period 30 \
  --preferred-backup-window "03:00-04:00"

# Manual snapshot
aws rds create-db-snapshot \
  --db-instance-identifier litellm-db \
  --db-snapshot-identifier litellm-snapshot-$(date +%Y%m%d)

#!/bin/bash
# Backup script
BACKUP_DIR="/backups/postgresql"
DATE=$(date +%Y%m%d_%H%M%S)

# Full backup
pg_dump -h postgres-primary -U litellm -Fc litellm > \
  $BACKUP_DIR/litellm_$DATE.dump

# Upload to S3
aws s3 cp $BACKUP_DIR/litellm_$DATE.dump \
  s3://litellm-backups/postgresql/

# Cleanup old backups (keep 30 days)
find $BACKUP_DIR -name "*.dump" -mtime +30 -delete

Schedule with cron:

0 3 * * * /usr/local/bin/backup-litellm-db.sh

Configuration backups:

# Backup config
kubectl get configmap litellm-config -o yaml > config-backup.yaml

# Backup secrets (encrypted)
kubectl get secrets litellm-secrets -o yaml | \
  openssl enc -aes-256-cbc -salt -out secrets-backup.enc

Recovery Procedures

Database recovery:

# Restore from snapshot (AWS RDS)
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier litellm-db-restored \
  --db-snapshot-identifier litellm-snapshot-20240101

# Restore from pg_dump
pg_restore -h postgres-primary -U litellm -d litellm \
  /backups/litellm_20240101.dump

Failover to standby region:

# 1. Update DNS to point to standby region
aws route53 change-resource-record-sets \
  --hosted-zone-id Z123456 \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "api.yourdomain.com",
        "Type": "CNAME",
        "TTL": 60,
        "ResourceRecords": [{"Value": "litellm-standby.region2.elb.amazonaws.com"}]
      }
    }]
  }'

# 2. Promote standby database to primary
aws rds promote-read-replica \
  --db-instance-identifier litellm-db-replica-region2

# 3. Scale up standby region
kubectl scale deployment litellm --replicas=5 -n litellm-standby

Monitoring and Alerting

Critical Alerts

prometheus-alerts.yml

groups:
  - name: litellm-ha
    rules:
      - alert: InstanceDown
        expr: up{job="litellm"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "LiteLLM instance {{ $labels.instance }} is down"
      
      - alert: HighErrorRate
        expr: |
          (sum(rate(litellm_requests_total{status!="success"}[5m])) /
           sum(rate(litellm_requests_total[5m]))) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 1%"
      
      - alert: DatabaseConnectionFailed
        expr: litellm_database_connection_status == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Database connection failed"
      
      - alert: AllProvidersDown
        expr: sum(litellm_model_health_status{status="healthy"}) == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "All LLM providers are unhealthy"

Testing HA Setup

Chaos Engineering

Simulate failures:

# Kill random pod
kubectl delete pod $(kubectl get pods -l app=litellm -o name | shuf -n1)

# Simulate network partition
kubectl exec -it litellm-xxx -- iptables -A OUTPUT -d postgres-primary -j DROP

# Overload with traffic
k6 run --vus 1000 --duration 5m load-test.js

Use Chaos Mesh (Kubernetes):

chaos-experiment.yaml

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: litellm-pod-failure
spec:
  action: pod-failure
  mode: one
  selector:
    namespaces:
      - litellm
    labelSelectors:
      app: litellm
  duration: '30s'

Disaster Recovery Drills

Monthly DR drill checklist:

Simulate Primary Region Failure

# Block traffic to primary region
kubectl scale deployment litellm --replicas=0 -n production

Verify Automatic Failover

Monitor traffic shifts to standby
Check error rates remain < 1%
Verify latency increases < 100ms

Test Database Failover

# Promote replica
aws rds promote-read-replica --db-instance-identifier litellm-db-replica

Validate Application

# Run smoke tests
curl -X POST https://api.yourdomain.com/v1/chat/completions \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "test"}]
  }'

Document Results

RTO achieved
RPO achieved
Issues encountered
Action items

Restore Primary

kubectl scale deployment litellm --replicas=5 -n production

Best Practices Summary

Design for Failure

Assume everything will fail
No single points of failure
Multiple availability zones/regions
Redundant components

Automate Recovery

Health checks at all levels
Automatic failover
Self-healing systems
Circuit breakers

Monitor Everything

Comprehensive metrics
Proactive alerting
Distributed tracing
Regular testing

Practice DR

Monthly DR drills
Documented runbooks
Automated testing
Post-mortem analysis

Next Steps

Monitoring

Set up comprehensive monitoring for HA

Performance

Optimize performance at scale

Security

Secure your HA deployment

Troubleshooting

Debug HA-specific issues

Deploy

Production

High Availability Deployment

Overview

Architecture Patterns

Single Region HA

Multi-Region Active-Passive

Multi-Region Active-Active

Load Balancing

Application Load Balancer (ALB)

Routing Strategies

Database High Availability

PostgreSQL Replication

Connection Pooling

Redis High Availability

Redis Sentinel

Redis Cluster

Provider Failover

Multi-Provider Configuration

Retry Logic

Autoscaling

Kubernetes HPA

KEDA (Advanced)

Disaster Recovery

Backup Strategy

Recovery Procedures

Monitoring and Alerting

Critical Alerts

Testing HA Setup

Chaos Engineering

Disaster Recovery Drills

Best Practices Summary

Next Steps

Monitoring

Performance

Security

Troubleshooting

Build docs developers (and LLMs) love

Deploy

Production

​Overview

​Architecture Patterns

​Single Region HA

​Multi-Region Active-Passive

​Multi-Region Active-Active

​Load Balancing

​Application Load Balancer (ALB)

​Routing Strategies

​Database High Availability

​PostgreSQL Replication

​Connection Pooling

​Redis High Availability

​Redis Sentinel

​Redis Cluster

​Provider Failover

​Multi-Provider Configuration

​Retry Logic

​Autoscaling

​Kubernetes HPA

​KEDA (Advanced)

​Disaster Recovery

​Backup Strategy

​Recovery Procedures

​Monitoring and Alerting

​Critical Alerts

​Testing HA Setup

​Chaos Engineering

​Disaster Recovery Drills

​Best Practices Summary

​Next Steps

Monitoring

Performance

Security

Troubleshooting

Build docs developers (and LLMs) love

Overview

Architecture Patterns

Single Region HA

Multi-Region Active-Passive

Multi-Region Active-Active

Load Balancing

Application Load Balancer (ALB)

Routing Strategies

Database High Availability

PostgreSQL Replication

Connection Pooling

Redis High Availability

Redis Sentinel

Redis Cluster

Provider Failover

Multi-Provider Configuration

Retry Logic

Autoscaling

Kubernetes HPA

KEDA (Advanced)

Disaster Recovery

Backup Strategy

Recovery Procedures

Monitoring and Alerting

Critical Alerts

Testing HA Setup

Chaos Engineering

Disaster Recovery Drills

Best Practices Summary

Next Steps