Overview
High availability (HA) ensures LiteLLM remains operational during:
Infrastructure failures
Network outages
Database issues
Provider API failures
Traffic spikes
Target SLAs:
Uptime: 99.9% (8.76 hours downtime/year)
Latency: P95 < 5 seconds
Error rate: < 0.1%
Architecture Patterns
Single Region HA
Components :
Load Balancer (ALB/NLB)
│
├─ LiteLLM Pod 1 (AZ-1)
├─ LiteLLM Pod 2 (AZ-2)
└─ LiteLLM Pod 3 (AZ-3)
│
├─ PostgreSQL Primary (AZ-1)
└─ PostgreSQL Replica (AZ-2)
│
├─ Redis Primary (AZ-1)
└─ Redis Replica (AZ-2)
Benefits:
Zone-level fault tolerance
Lower latency within region
Simpler to manage
Limitations:
No region-level DR
Higher latency for distant users
Multi-Region Active-Passive
Region 1 (Primary - US-East) :
LiteLLM Pods (3x)
PostgreSQL Primary
Redis Primary
│
│ Replication
↓
Region 2 (Standby - US-West) :
LiteLLM Pods (0-1x, auto-start on failover)
PostgreSQL Read Replica
Redis Replica
Benefits:
Region-level DR
Lower cost (standby minimal)
Simple failover
Limitations:
Manual failover may be required
RTO: 2-5 minutes
RPO: 30-60 seconds
Multi-Region Active-Active
Global Load Balancer (GeoDNS)
│
├─ US-East
│ ├─ LiteLLM Pods (3x)
│ ├─ PostgreSQL Primary
│ └─ Redis Primary
│
├─ EU-West
│ ├─ LiteLLM Pods (3x)
│ ├─ PostgreSQL Primary
│ └─ Redis Primary
│
└─ AP-Southeast
├─ LiteLLM Pods (2x)
├─ PostgreSQL Primary
└─ Redis Primary
Database Replication :
US-East ↔ EU-West ↔ AP-Southeast
(Multi-master or primary with replicas)
Benefits:
Global low latency
No single point of failure
Automatic failover
RTO: 0 (instant)
RPO: 0 (no data loss)
Considerations:
Higher cost
Complex data consistency
Requires conflict resolution
Load Balancing
Application Load Balancer (ALB)
AWS
GCP
Kubernetes Ingress
# Create target group
aws elbv2 create-target-group \
--name litellm-tg \
--protocol HTTP \
--port 4000 \
--vpc-id vpc-xxx \
--health-check-path /health/liveliness \
--health-check-interval-seconds 30 \
--health-check-timeout-seconds 10 \
--healthy-threshold-count 2 \
--unhealthy-threshold-count 3
# Create ALB
aws elbv2 create-load-balancer \
--name litellm-alb \
--subnets subnet-xxx subnet-yyy subnet-zzz \
--security-groups sg-xxx \
--scheme internet-facing
# Create listener
aws elbv2 create-listener \
--load-balancer-arn arn:aws:elasticloadbalancing:... \
--protocol HTTPS \
--port 443 \
--certificates CertificateArn=arn:aws:acm:... \
--default-actions Type=forward,TargetGroupArn=arn:aws:elasticloadbalancing:...
# Create health check
gcloud compute health-checks create http litellm-health \
--port 4000 \
--request-path /health/liveliness
# Create backend service
gcloud compute backend-services create litellm-backend \
--protocol HTTP \
--health-checks litellm-health \
--global
# Add instance groups
gcloud compute backend-services add-backend litellm-backend \
--instance-group litellm-ig-us \
--instance-group-zone us-east1-b \
--global
# Create URL map and HTTPS proxy
gcloud compute url-maps create litellm-lb \
--default-service litellm-backend
gcloud compute target-https-proxies create litellm-https-proxy \
--url-map litellm-lb \
--ssl-certificates litellm-cert
# Create forwarding rule
gcloud compute forwarding-rules create litellm-https \
--address litellm-ip \
--global \
--target-https-proxy litellm-https-proxy \
--ports 443
apiVersion : networking.k8s.io/v1
kind : Ingress
metadata :
name : litellm-ingress
annotations :
kubernetes.io/ingress.class : nginx
nginx.ingress.kubernetes.io/ssl-redirect : "true"
nginx.ingress.kubernetes.io/upstream-hash-by : "$binary_remote_addr"
nginx.ingress.kubernetes.io/load-balance : "ewma" # Latency-based
spec :
tls :
- hosts :
- api.yourdomain.com
secretName : litellm-tls
rules :
- host : api.yourdomain.com
http :
paths :
- path : /
pathType : Prefix
backend :
service :
name : litellm
port :
number : 4000
Routing Strategies
Round Robin (default):
Prometrics :
Simple, even distribution
Cons :
No consideration for backend load/latency
Least Connections:
upstream litellm {
least_conn ;
server litellm-1:4000;
server litellm-2:4000;
server litellm-3:4000;
}
Latency-Based (recommended):
# NGINX with EWMA (Exponentially Weighted Moving Average)
upstream litellm {
random two least_time=header;
server litellm-1:4000;
server litellm-2:4000;
server litellm-3:4000;
}
Sticky Sessions:
upstream litellm {
ip_hash ; # Or use cookies
server litellm-1:4000;
server litellm-2:4000;
server litellm-3:4000;
}
Database High Availability
PostgreSQL Replication
AWS RDS
GCP Cloud SQL
Self-Managed (Patroni)
# Create primary instance
aws rds create-db-instance \
--db-instance-identifier litellm-db-primary \
--db-instance-class db.r6g.large \
--engine postgres \
--engine-version 16.1 \
--master-username litellm \
--master-user-password YourPassword \
--allocated-storage 100 \
--storage-type gp3 \
--storage-encrypted \
--backup-retention-period 7 \
--multi-az # High availability
# Create read replicas
aws rds create-db-instance-read-replica \
--db-instance-identifier litellm-db-replica-1 \
--source-db-instance-identifier litellm-db-primary \
--availability-zone us-east-1b
aws rds create-db-instance-read-replica \
--db-instance-identifier litellm-db-replica-2 \
--source-db-instance-identifier litellm-db-primary \
--availability-zone us-east-1c
# Create HA instance (automatic failover)
gcloud sql instances create litellm-db \
--database-version=POSTGRES_16 \
--tier=db-custom-4-16384 \
--region=us-central1 \
--availability-type=REGIONAL \
--backup \
--backup-start-time=03:00
# Create read replicas
gcloud sql instances create litellm-db-read-1 \
--master-instance-name=litellm-db \
--tier=db-custom-2-8192 \
--replica-type=READ \
--zone=us-central1-b
scope : litellm
namespace : /litellm/
name : postgres-1
restapi :
listen : 0.0.0.0:8008
connect_address : postgres-1:8008
etcd :
hosts : etcd-1:2379,etcd-2:2379,etcd-3:2379
bootstrap :
dcs :
ttl : 30
loop_wait : 10
retry_timeout : 10
maximum_lag_on_failover : 1048576
postgresql :
use_pg_rewind : true
parameters :
wal_level : replica
hot_standby : "on"
max_wal_senders : 10
max_replication_slots : 10
postgresql :
listen : 0.0.0.0:5432
connect_address : postgres-1:5432
data_dir : /var/lib/postgresql/16/main
authentication :
replication :
username : replicator
password : RepPassword
superuser :
username : postgres
password : PostgresPassword
Connection Pooling
PgBouncer for connection pooling:
[databases]
litellm = host =postgres-primary port =5432 dbname =litellm
[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
# Connection pooling
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25
reserve_pool_size = 5
reserve_pool_timeout = 3
# Performance
server_idle_timeout = 600
server_lifetime = 3600
server_connect_timeout = 15
Deploy with LiteLLM:
services :
pgbouncer :
image : pgbouncer/pgbouncer:latest
ports :
- "6432:6432"
volumes :
- ./pgbouncer.ini:/etc/pgbouncer/pgbouncer.ini
- ./userlist.txt:/etc/pgbouncer/userlist.txt
environment :
DATABASES : litellm=host=postgres-primary port=5432 dbname=litellm
litellm :
image : ghcr.io/berriai/litellm:main-stable
environment :
DATABASE_URL : postgresql://litellm:pass@pgbouncer:6432/litellm
Redis High Availability
Redis Sentinel
services :
redis-master :
image : redis:7-alpine
command : redis-server --appendonly yes
volumes :
- redis-master-data:/data
redis-replica-1 :
image : redis:7-alpine
command : redis-server --slaveof redis-master 6379 --appendonly yes
volumes :
- redis-replica-1-data:/data
depends_on :
- redis-master
redis-sentinel-1 :
image : redis:7-alpine
command : >
sh -c '
echo "port 26379" > /tmp/sentinel.conf &&
echo "sentinel monitor mymaster redis-master 6379 2" >> /tmp/sentinel.conf &&
echo "sentinel down-after-milliseconds mymaster 5000" >> /tmp/sentinel.conf &&
echo "sentinel parallel-syncs mymaster 1" >> /tmp/sentinel.conf &&
echo "sentinel failover-timeout mymaster 10000" >> /tmp/sentinel.conf &&
redis-sentinel /tmp/sentinel.conf
'
ports :
- "26379:26379"
depends_on :
- redis-master
- redis-replica-1
LiteLLM config with Sentinel:
general_settings :
cache : true
redis_host : os.environ/REDIS_SENTINEL_HOST
redis_port : 26379
redis_password : os.environ/REDIS_PASSWORD
redis_use_sentinel : true
redis_sentinel_master_name : mymaster
Redis Cluster
For higher throughput:
# Create 6-node cluster (3 masters, 3 replicas)
redis-cli --cluster create \
redis-1:6379 redis-2:6379 redis-3:6379 \
redis-4:6379 redis-5:6379 redis-6:6379 \
--cluster-replicas 1
Provider Failover
Multi-Provider Configuration
Configure multiple providers for the same model to ensure availability if one provider has an outage.
model_list :
# OpenAI GPT-4o
- model_name : gpt-4o
litellm_params :
model : gpt-4o
api_key : os.environ/OPENAI_API_KEY
# Azure OpenAI GPT-4o (fallback)
- model_name : gpt-4o
litellm_params :
model : azure/gpt-4o
api_key : os.environ/AZURE_API_KEY
api_base : os.environ/AZURE_API_BASE
api_version : "2024-02-15-preview"
# Anthropic Claude (alternative)
- model_name : claude-sonnet-4
litellm_params :
model : anthropic/claude-sonnet-4-20250514
api_key : os.environ/ANTHROPIC_API_KEY
# AWS Bedrock Claude (fallback)
- model_name : claude-sonnet-4
litellm_params :
model : bedrock/anthropic.claude-3-sonnet-20240229-v1:0
aws_access_key_id : os.environ/AWS_ACCESS_KEY_ID
aws_secret_access_key : os.environ/AWS_SECRET_ACCESS_KEY
aws_region_name : us-east-1
router_settings :
routing_strategy : latency-based-routing
allowed_fails : 3 # Mark unhealthy after 3 failures
cooldown_time : 60 # Retry after 60 seconds
num_retries : 2 # Retry failed requests
retry_after : 10 # Wait 10s between retries
timeout : 30 # Request timeout
Retry Logic
router_settings :
num_retries : 3
retry_after : 5
timeout : 30
# Retry on specific errors
retry_policy :
- status_code : 429 # Rate limit
retry_after : 60
- status_code : 500 # Internal error
retry_after : 10
- status_code : 503 # Service unavailable
retry_after : 30
# Fallback models
fallbacks :
- model : gpt-4o
fallback_models : [ "azure/gpt-4o" , "claude-sonnet-4" ]
Autoscaling
Kubernetes HPA
apiVersion : autoscaling/v2
kind : HorizontalPodAutoscaler
metadata :
name : litellm-hpa
spec :
scaleTargetRef :
apiVersion : apps/v1
kind : Deployment
name : litellm
minReplicas : 3
maxReplicas : 20
metrics :
- type : Resource
resource :
name : cpu
target :
type : Utilization
averageUtilization : 70
- type : Resource
resource :
name : memory
target :
type : Utilization
averageUtilization : 80
behavior :
scaleDown :
stabilizationWindowSeconds : 300
policies :
- type : Pods
value : 1
periodSeconds : 60
scaleUp :
stabilizationWindowSeconds : 60
policies :
- type : Pods
value : 2
periodSeconds : 60
KEDA (Advanced)
apiVersion : keda.sh/v1alpha1
kind : ScaledObject
metadata :
name : litellm-scaler
spec :
scaleTargetRef :
name : litellm
minReplicaCount : 3
maxReplicaCount : 50
pollingInterval : 15
cooldownPeriod : 300
triggers :
# Scale on request rate
- type : prometheus
metadata :
serverAddress : http://prometheus:9090
metricName : litellm_requests_per_second
threshold : '1000'
query : |
sum(rate(litellm_requests_total[2m]))
# Scale on queue depth (if using queues)
- type : rabbitmq
metadata :
queueName : litellm-requests
queueLength : '50'
# Scale on CPU
- type : cpu
metadataType : Utilization
metadata :
value : '70'
Disaster Recovery
Backup Strategy
Database backups:
Automated (AWS RDS)
pg_dump
# Automatic daily backups (retained 7 days)
aws rds modify-db-instance \
--db-instance-identifier litellm-db \
--backup-retention-period 30 \
--preferred-backup-window "03:00-04:00"
# Manual snapshot
aws rds create-db-snapshot \
--db-instance-identifier litellm-db \
--db-snapshot-identifier litellm-snapshot- $( date +%Y%m%d )
#!/bin/bash
# Backup script
BACKUP_DIR = "/backups/postgresql"
DATE = $( date +%Y%m%d_%H%M%S )
# Full backup
pg_dump -h postgres-primary -U litellm -Fc litellm > \
$BACKUP_DIR /litellm_ $DATE .dump
# Upload to S3
aws s3 cp $BACKUP_DIR /litellm_ $DATE .dump \
s3://litellm-backups/postgresql/
# Cleanup old backups (keep 30 days)
find $BACKUP_DIR -name "*.dump" -mtime +30 -delete
Schedule with cron: 0 3 * * * /usr/local/bin/backup-litellm-db.sh
Configuration backups:
# Backup config
kubectl get configmap litellm-config -o yaml > config-backup.yaml
# Backup secrets (encrypted)
kubectl get secrets litellm-secrets -o yaml | \
openssl enc -aes-256-cbc -salt -out secrets-backup.enc
Recovery Procedures
Database recovery:
# Restore from snapshot (AWS RDS)
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier litellm-db-restored \
--db-snapshot-identifier litellm-snapshot-20240101
# Restore from pg_dump
pg_restore -h postgres-primary -U litellm -d litellm \
/backups/litellm_20240101.dump
Failover to standby region:
# 1. Update DNS to point to standby region
aws route53 change-resource-record-sets \
--hosted-zone-id Z123456 \
--change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "api.yourdomain.com",
"Type": "CNAME",
"TTL": 60,
"ResourceRecords": [{"Value": "litellm-standby.region2.elb.amazonaws.com"}]
}
}]
}'
# 2. Promote standby database to primary
aws rds promote-read-replica \
--db-instance-identifier litellm-db-replica-region2
# 3. Scale up standby region
kubectl scale deployment litellm --replicas=5 -n litellm-standby
Monitoring and Alerting
Critical Alerts
groups :
- name : litellm-ha
rules :
- alert : InstanceDown
expr : up{job="litellm"} == 0
for : 2m
labels :
severity : critical
annotations :
summary : "LiteLLM instance {{ $labels.instance }} is down"
- alert : HighErrorRate
expr : |
(sum(rate(litellm_requests_total{status!="success"}[5m])) /
sum(rate(litellm_requests_total[5m]))) > 0.01
for : 5m
labels :
severity : critical
annotations :
summary : "Error rate above 1%"
- alert : DatabaseConnectionFailed
expr : litellm_database_connection_status == 0
for : 1m
labels :
severity : critical
annotations :
summary : "Database connection failed"
- alert : AllProvidersDown
expr : sum(litellm_model_health_status{status="healthy"}) == 0
for : 2m
labels :
severity : critical
annotations :
summary : "All LLM providers are unhealthy"
Testing HA Setup
Chaos Engineering
Simulate failures:
# Kill random pod
kubectl delete pod $( kubectl get pods -l app=litellm -o name | shuf -n1 )
# Simulate network partition
kubectl exec -it litellm-xxx -- iptables -A OUTPUT -d postgres-primary -j DROP
# Overload with traffic
k6 run --vus 1000 --duration 5m load-test.js
Use Chaos Mesh (Kubernetes):
apiVersion : chaos-mesh.org/v1alpha1
kind : PodChaos
metadata :
name : litellm-pod-failure
spec :
action : pod-failure
mode : one
selector :
namespaces :
- litellm
labelSelectors :
app : litellm
duration : '30s'
Disaster Recovery Drills
Monthly DR drill checklist:
Simulate Primary Region Failure
# Block traffic to primary region
kubectl scale deployment litellm --replicas=0 -n production
Verify Automatic Failover
Monitor traffic shifts to standby
Check error rates remain < 1%
Verify latency increases < 100ms
Test Database Failover
# Promote replica
aws rds promote-read-replica --db-instance-identifier litellm-db-replica
Validate Application
# Run smoke tests
curl -X POST https://api.yourdomain.com/v1/chat/completions \
-H "Authorization: Bearer $API_KEY " \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "test"}]
}'
Document Results
RTO achieved
RPO achieved
Issues encountered
Action items
Restore Primary
kubectl scale deployment litellm --replicas=5 -n production
Best Practices Summary
Design for Failure
Assume everything will fail
No single points of failure
Multiple availability zones/regions
Redundant components
Automate Recovery
Health checks at all levels
Automatic failover
Self-healing systems
Circuit breakers
Monitor Everything
Comprehensive metrics
Proactive alerting
Distributed tracing
Regular testing
Practice DR
Monthly DR drills
Documented runbooks
Automated testing
Post-mortem analysis
Next Steps
Monitoring Set up comprehensive monitoring for HA
Performance Optimize performance at scale
Security Secure your HA deployment
Troubleshooting Debug HA-specific issues