LiteLLM benchmarks (1000 RPS):
P50 latency: 2ms (proxy overhead)
P95 latency: 8ms (proxy overhead)
P99 latency: 15ms (proxy overhead)
Throughput: 10,000+ RPS per instance
Total latency = LiteLLM overhead + Provider API latency. Provider latency dominates (500ms-5s).
Caching Strategy
Redis Caching
Cache identical requests to reduce provider API calls:
general_settings :
cache : true
cache_params :
type : redis
host : os.environ/REDIS_HOST
port : os.environ/REDIS_PORT
password : os.environ/REDIS_PASSWORD
ttl : 3600 # Cache for 1 hour
model_list :
- model_name : gpt-4o
litellm_params :
model : gpt-4o
api_key : os.environ/OPENAI_API_KEY
cache :
ttl : 3600
s_max_age : 3600 # Support s-maxage header
Benefits:
Cost savings: Eliminate redundant API calls
Latency reduction: Redis response < 5ms vs provider 1-5s
Rate limit protection: Reduce pressure on provider limits
Semantic Caching
Cache similar (not just identical) requests:
general_settings :
cache : true
cache_params :
type : redis
similarity_threshold : 0.9 # 90% similarity
supported_call_types : [ "completion" , "embeddings" ]
litellm_settings :
cache_kwargs :
semantic_similarity : true
similarity_threshold : 0.9
Example:
# Request 1: "What is the capital of France?"
# Request 2: "What's the capital city of France?"
# → Semantic cache returns same result (95% similar)
response = client.chat.completions.create(
model = "gpt-4o" ,
messages = [{ "role" : "user" , "content" : "What is the capital of France?" }],
extra_body = { "cache" : { "ttl" : 3600 }}
)
# Control caching per request
response = client.chat.completions.create(
model = "gpt-4o" ,
messages = [ ... ],
extra_headers = {
"Cache-Control" : "max-age=3600" # Cache for 1 hour
# "Cache-Control": "no-cache" # Skip cache
# "Cache-Control": "no-store" # Don't cache response
}
)
Redis Optimization
redis :
image : redis:7-alpine
command : >
redis-server
--maxmemory 2gb
--maxmemory-policy allkeys-lru
--appendonly yes
--tcp-backlog 511
--timeout 0
--tcp-keepalive 300
--databases 16
--save 900 1
--save 300 10
--save 60 10000
volumes :
- redis_data:/data
Connection Pooling
HTTP Connection Reuse
LiteLLM reuses HTTP connections to providers:
general_settings :
# Connection pooling (default enabled)
connection_pool_size : 100 # Max connections per provider
connection_pool_timeout : 30 # Connection timeout
router_settings :
timeout : 60 # Request timeout
Database Connection Pooling
Use PgBouncer to pool database connections:
[databases]
litellm = host =postgres port =5432 dbname =litellm
[pgbouncer]
pool_mode = transaction # Most efficient
default_pool_size = 25
reserve_pool_size = 5
max_client_conn = 1000
min_pool_size = 10
# Performance tuning
server_idle_timeout = 600
server_lifetime = 3600
query_timeout = 30
query_wait_timeout = 120
Deploy:
services :
pgbouncer :
image : pgbouncer/pgbouncer:latest
environment :
DATABASES_HOST : postgres
DATABASES_PORT : 5432
DATABASES_DBNAME : litellm
PGBOUNCER_POOL_MODE : transaction
PGBOUNCER_DEFAULT_POOL_SIZE : 25
PGBOUNCER_MAX_CLIENT_CONN : 1000
ports :
- "6432:6432"
litellm :
environment :
DATABASE_URL : postgresql://user:pass@pgbouncer:6432/litellm
Async Request Processing
Worker Configuration
# Run with multiple workers (Gunicorn)
litellm --config config.yaml --port 4000 --num_workers 4
Or in Docker:
CMD [ "litellm" , "--config" , "/app/config.yaml" , "--port" , "4000" , "--num_workers" , "4" ]
Worker sizing:
# Formula: (2 × CPU cores) + 1
# For 4-core machine: (2 × 4) + 1 = 9 workers
litellm --num_workers 9
Async Database Operations
LiteLLM uses async Prisma client for non-blocking DB operations:
general_settings :
database_url : postgresql://user:pass@host:5432/litellm
database_connection_pool_limit : 100 # Async pool size
database_connection_timeout : 30
Request Batching
Batch API Requests
For non-real-time workloads, use batch APIs:
# Submit batch
from litellm import batch_completion
batch = client.batches.create(
input_file_id = "file-abc123" ,
endpoint = "/v1/chat/completions" ,
completion_window = "24h"
)
# Check status
status = client.batches.retrieve(batch.id)
# Benefits:
# - 50% cost reduction (OpenAI)
# - Higher throughput
# - No rate limits
Streaming Responses
Reduce time-to-first-token:
# Stream tokens as they arrive
stream = client.chat.completions.create(
model = "gpt-4o" ,
messages = [{ "role" : "user" , "content" : "Write a story" }],
stream = True
)
for chunk in stream:
if chunk.choices[ 0 ].delta.content:
print (chunk.choices[ 0 ].delta.content, end = "" )
# Benefits:
# - Lower perceived latency
# - Better UX for long responses
# - Reduced memory usage
Load Balancing
Provider Load Balancing
Distribute load across multiple deployments:
model_list :
# OpenAI endpoint 1
- model_name : gpt-4o
litellm_params :
model : gpt-4o
api_key : os.environ/OPENAI_API_KEY_1
# OpenAI endpoint 2
- model_name : gpt-4o
litellm_params :
model : gpt-4o
api_key : os.environ/OPENAI_API_KEY_2
# Azure OpenAI (load distribution)
- model_name : gpt-4o
litellm_params :
model : azure/gpt-4o
api_key : os.environ/AZURE_API_KEY
api_base : os.environ/AZURE_API_BASE
router_settings :
routing_strategy : latency-based-routing
retry_after : 5
num_retries : 2
Routing strategies:
Routes to fastest provider: router_settings :
routing_strategy : latency-based-routing
latency_threshold : 0.5 # 500ms difference
Routes to provider with most capacity: router_settings :
routing_strategy : least-busy
rpm_limit : 1000 # Per deployment
Even distribution: router_settings :
routing_strategy : simple-shuffle
Cheapest option: router_settings :
routing_strategy : cost-based-routing
Geographic Distribution
Deploy close to users:
Regions :
US-East : api-us.example.com # Latency: 10ms (US users)
EU-West : api-eu.example.com # Latency: 15ms (EU users)
AP-Southeast : api-ap.example.com # Latency: 20ms (Asia users)
GeoDNS :
Route users to nearest region
Failover to next closest on failure
Resource Optimization
Container Resources
Kubernetes resource limits:
apiVersion : apps/v1
kind : Deployment
metadata :
name : litellm
spec :
template :
spec :
containers :
- name : litellm
resources :
requests :
cpu : 500m # 0.5 CPU
memory : 512Mi # 512MB
limits :
cpu : 2000m # 2 CPU
memory : 2Gi # 2GB
Sizing guidelines:
Traffic CPU Memory Replicas < 100 RPS 500m 512Mi 2 100-500 RPS 1000m 1Gi 3 500-1000 RPS 2000m 2Gi 5 1000-5000 RPS 2000m 4Gi 10-20 > 5000 RPS 4000m 8Gi 20+
Memory Optimization
general_settings :
# Reduce memory usage
drop_params : true # Don't store full request in memory
litellm_settings :
max_tokens : 4096 # Limit response size
router_settings :
max_parallel_requests : 100 # Limit concurrent requests
Monitor memory:
# Kubernetes
kubectl top pods -l app=litellm
# Docker
docker stats litellm
# Prometheus query
container_memory_usage_bytes {pod=~ "litellm.*" }
Database Optimization
Query Optimization
LiteLLM maintains aggregated tables for fast queries:
-- Use daily aggregates (faster than querying raw logs)
SELECT
date ,
SUM (spend) as daily_spend,
SUM (api_requests) as requests
FROM "LiteLLM_DailyUserSpend"
WHERE date >= CURRENT_DATE - INTERVAL '30 days'
AND user_id = 'user-123'
GROUP BY date ;
-- Instead of (slower):
SELECT
DATE ( "startTime" ) as date ,
SUM (spend) as daily_spend,
COUNT ( * ) as requests
FROM "LiteLLM_SpendLogs"
WHERE "startTime" >= CURRENT_DATE - INTERVAL '30 days'
AND "user" = 'user-123'
GROUP BY date ;
Index Optimization
Prisma creates indexes automatically, but add custom indexes for hot queries:
-- Index for API key lookups
CREATE INDEX CONCURRENTLY idx_spend_logs_api_key_time
ON "LiteLLM_SpendLogs" (api_key, "startTime" DESC );
-- Index for team queries
CREATE INDEX CONCURRENTLY idx_spend_logs_team_time
ON "LiteLLM_SpendLogs" (team_id, "startTime" DESC );
-- Partial index for recent logs
CREATE INDEX CONCURRENTLY idx_spend_logs_recent
ON "LiteLLM_SpendLogs" ( "startTime" )
WHERE "startTime" >= NOW () - INTERVAL '7 days' ;
Partitioning
For high-volume deployments, partition large tables:
-- Partition spend logs by month
CREATE TABLE " LiteLLM_SpendLogs_2024_01 "
PARTITION OF "LiteLLM_SpendLogs"
FOR VALUES FROM ( '2024-01-01' ) TO ( '2024-02-01' );
CREATE TABLE " LiteLLM_SpendLogs_2024_02 "
PARTITION OF "LiteLLM_SpendLogs"
FOR VALUES FROM ( '2024-02-01' ) TO ( '2024-03-01' );
-- Automatically create partitions
CREATE EXTENSION IF NOT EXISTS pg_partman;
SELECT create_parent(
'public.LiteLLM_SpendLogs' ,
'startTime' ,
'native' ,
'monthly'
);
Archive Old Data
#!/bin/bash
# Archive logs older than 90 days
# Export to S3
pg_dump -h postgres -U litellm -t '"LiteLLM_SpendLogs"' \
--where " \" startTime \" < NOW() - INTERVAL '90 days'" \
-Fc > /tmp/archived_logs.dump
aws s3 cp /tmp/archived_logs.dump \
s3://litellm-archives/logs/archive- $( date +%Y%m%d ) .dump
# Delete from database
psql -h postgres -U litellm -d litellm -c "
DELETE FROM \" LiteLLM_SpendLogs \"
WHERE \" startTime \" < NOW() - INTERVAL '90 days';
VACUUM ANALYZE \" LiteLLM_SpendLogs \" ;
"
Key Metrics
# Request rate
rate(litellm_requests_total[5m])
# Latency percentiles
histogram_quantile(0.50, rate(litellm_request_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(litellm_request_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(litellm_request_duration_seconds_bucket[5m]))
# Error rate
sum(rate(litellm_requests_total{status!="success"}[5m])) /
sum(rate(litellm_requests_total[5m]))
# Cache hit rate
sum(rate(litellm_cache_hits_total[5m])) /
(sum(rate(litellm_cache_hits_total[5m])) + sum(rate(litellm_cache_misses_total[5m])))
# Provider latency breakdown
sum(rate(litellm_provider_latency_seconds_sum[5m])) by (provider) /
sum(rate(litellm_provider_latency_seconds_count[5m])) by (provider)
Grafana Dashboard
Performance overview panel:
{
"title" : "Request Rate" ,
"targets" : [
{
"expr" : "sum(rate(litellm_requests_total[5m]))" ,
"legendFormat" : "Total RPS"
}
]
},
{
"title" : "Latency (P50/P95/P99)" ,
"targets" : [
{
"expr" : "histogram_quantile(0.50, rate(litellm_request_duration_seconds_bucket[5m]))" ,
"legendFormat" : "P50"
},
{
"expr" : "histogram_quantile(0.95, rate(litellm_request_duration_seconds_bucket[5m]))" ,
"legendFormat" : "P95"
},
{
"expr" : "histogram_quantile(0.99, rate(litellm_request_duration_seconds_bucket[5m]))" ,
"legendFormat" : "P99"
}
]
}
Load Testing
K6 Load Test
import http from 'k6/http' ;
import { check , sleep } from 'k6' ;
export const options = {
stages: [
{ duration: '2m' , target: 100 }, // Ramp up to 100 RPS
{ duration: '5m' , target: 100 }, // Stay at 100 RPS
{ duration: '2m' , target: 500 }, // Ramp up to 500 RPS
{ duration: '5m' , target: 500 }, // Stay at 500 RPS
{ duration: '2m' , target: 0 }, // Ramp down
],
thresholds: {
http_req_duration: [ 'p(95)<5000' ], // 95% < 5s
http_req_failed: [ 'rate<0.01' ], // Error rate < 1%
},
};
const API_KEY = 'sk-...' ;
const BASE_URL = 'https://api.example.com' ;
export default function () {
const payload = JSON . stringify ({
model: 'gpt-4o' ,
messages: [{
role: 'user' ,
content: 'Hello, how are you?'
}],
max_tokens: 50 ,
});
const params = {
headers: {
'Content-Type' : 'application/json' ,
'Authorization' : `Bearer ${ API_KEY } ` ,
},
};
const res = http . post (
` ${ BASE_URL } /v1/chat/completions` ,
payload ,
params
);
check ( res , {
'status is 200' : ( r ) => r . status === 200 ,
'has choices' : ( r ) => JSON . parse ( r . body ). choices ?. length > 0 ,
});
sleep ( 1 );
}
Run:
Locust Load Test
from locust import HttpUser, task, between
import json
class LiteLLMUser ( HttpUser ):
wait_time = between( 1 , 3 )
def on_start ( self ):
self .headers = {
"Authorization" : "Bearer sk-..." ,
"Content-Type" : "application/json"
}
@task ( 10 )
def chat_completion ( self ):
payload = {
"model" : "gpt-4o" ,
"messages" : [{ "role" : "user" , "content" : "Hello" }],
"max_tokens" : 50
}
self .client.post(
"/v1/chat/completions" ,
data = json.dumps(payload),
headers = self .headers
)
@task ( 1 )
def embeddings ( self ):
payload = {
"model" : "text-embedding-3-small" ,
"input" : "Hello world"
}
self .client.post(
"/v1/embeddings" ,
data = json.dumps(payload),
headers = self .headers
)
Run:
locust -f locustfile.py --host https://api.example.com
Best Practices
Cache aggressively - 30-50% cache hit rate saves significant costs
Use streaming - Reduces perceived latency for long responses
Deploy globally - Route users to nearest region
Monitor everything - Track latency, errors, cache hits, resource usage
Load test regularly - Find bottlenecks before users do
Right-size resources - Start small, scale based on metrics
Archive old data - Keep database lean and fast
Use batching - For non-real-time workloads
Next Steps
Monitoring Track performance metrics
High Availability Scale for production traffic
Troubleshooting Debug performance issues
Security Optimize without compromising security