Skip to main content

Performance Targets

LiteLLM benchmarks (1000 RPS):
  • P50 latency: 2ms (proxy overhead)
  • P95 latency: 8ms (proxy overhead)
  • P99 latency: 15ms (proxy overhead)
  • Throughput: 10,000+ RPS per instance
Total latency = LiteLLM overhead + Provider API latency. Provider latency dominates (500ms-5s).

Caching Strategy

Redis Caching

Cache identical requests to reduce provider API calls:
config.yaml
general_settings:
  cache: true
  cache_params:
    type: redis
    host: os.environ/REDIS_HOST
    port: os.environ/REDIS_PORT
    password: os.environ/REDIS_PASSWORD
    ttl: 3600  # Cache for 1 hour
    
model_list:
  - model_name: gpt-4o
    litellm_params:
      model: gpt-4o
      api_key: os.environ/OPENAI_API_KEY
      cache:
        ttl: 3600
        s_max_age: 3600  # Support s-maxage header
Benefits:
  • Cost savings: Eliminate redundant API calls
  • Latency reduction: Redis response < 5ms vs provider 1-5s
  • Rate limit protection: Reduce pressure on provider limits

Semantic Caching

Cache similar (not just identical) requests:
config.yaml
general_settings:
  cache: true
  cache_params:
    type: redis
    similarity_threshold: 0.9  # 90% similarity
    supported_call_types: ["completion", "embeddings"]

litellm_settings:
  cache_kwargs:
    semantic_similarity: true
    similarity_threshold: 0.9
Example:
# Request 1: "What is the capital of France?"
# Request 2: "What's the capital city of France?"
# → Semantic cache returns same result (95% similar)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    extra_body={"cache": {"ttl": 3600}}
)

Cache Control Headers

# Control caching per request
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[...],
    extra_headers={
        "Cache-Control": "max-age=3600"  # Cache for 1 hour
        # "Cache-Control": "no-cache"  # Skip cache
        # "Cache-Control": "no-store"  # Don't cache response
    }
)

Redis Optimization

docker-compose.yml
redis:
  image: redis:7-alpine
  command: >
    redis-server
    --maxmemory 2gb
    --maxmemory-policy allkeys-lru
    --appendonly yes
    --tcp-backlog 511
    --timeout 0
    --tcp-keepalive 300
    --databases 16
    --save 900 1
    --save 300 10
    --save 60 10000
  volumes:
    - redis_data:/data

Connection Pooling

HTTP Connection Reuse

LiteLLM reuses HTTP connections to providers:
config.yaml
general_settings:
  # Connection pooling (default enabled)
  connection_pool_size: 100  # Max connections per provider
  connection_pool_timeout: 30  # Connection timeout
  
router_settings:
  timeout: 60  # Request timeout

Database Connection Pooling

Use PgBouncer to pool database connections:
pgbouncer.ini
[databases]
litellm = host=postgres port=5432 dbname=litellm

[pgbouncer]
pool_mode = transaction  # Most efficient
default_pool_size = 25
reserve_pool_size = 5
max_client_conn = 1000
min_pool_size = 10

# Performance tuning
server_idle_timeout = 600
server_lifetime = 3600
query_timeout = 30
query_wait_timeout = 120
Deploy:
docker-compose.yml
services:
  pgbouncer:
    image: pgbouncer/pgbouncer:latest
    environment:
      DATABASES_HOST: postgres
      DATABASES_PORT: 5432
      DATABASES_DBNAME: litellm
      PGBOUNCER_POOL_MODE: transaction
      PGBOUNCER_DEFAULT_POOL_SIZE: 25
      PGBOUNCER_MAX_CLIENT_CONN: 1000
    ports:
      - "6432:6432"
  
  litellm:
    environment:
      DATABASE_URL: postgresql://user:pass@pgbouncer:6432/litellm

Async Request Processing

Worker Configuration

# Run with multiple workers (Gunicorn)
litellm --config config.yaml --port 4000 --num_workers 4
Or in Docker:
CMD ["litellm", "--config", "/app/config.yaml", "--port", "4000", "--num_workers", "4"]
Worker sizing:
# Formula: (2 × CPU cores) + 1
# For 4-core machine: (2 × 4) + 1 = 9 workers

litellm --num_workers 9

Async Database Operations

LiteLLM uses async Prisma client for non-blocking DB operations:
config.yaml
general_settings:
  database_url: postgresql://user:pass@host:5432/litellm
  database_connection_pool_limit: 100  # Async pool size
  database_connection_timeout: 30

Request Batching

Batch API Requests

For non-real-time workloads, use batch APIs:
# Submit batch
from litellm import batch_completion

batch = client.batches.create(
    input_file_id="file-abc123",
    endpoint="/v1/chat/completions",
    completion_window="24h"
)

# Check status
status = client.batches.retrieve(batch.id)

# Benefits:
# - 50% cost reduction (OpenAI)
# - Higher throughput
# - No rate limits

Streaming Responses

Reduce time-to-first-token:
# Stream tokens as they arrive
stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

# Benefits:
# - Lower perceived latency
# - Better UX for long responses
# - Reduced memory usage

Load Balancing

Provider Load Balancing

Distribute load across multiple deployments:
config.yaml
model_list:
  # OpenAI endpoint 1
  - model_name: gpt-4o
    litellm_params:
      model: gpt-4o
      api_key: os.environ/OPENAI_API_KEY_1
  
  # OpenAI endpoint 2
  - model_name: gpt-4o
    litellm_params:
      model: gpt-4o
      api_key: os.environ/OPENAI_API_KEY_2
  
  # Azure OpenAI (load distribution)
  - model_name: gpt-4o
    litellm_params:
      model: azure/gpt-4o
      api_key: os.environ/AZURE_API_KEY
      api_base: os.environ/AZURE_API_BASE

router_settings:
  routing_strategy: latency-based-routing
  retry_after: 5
  num_retries: 2
Routing strategies:

Geographic Distribution

Deploy close to users:
Regions:
  US-East:    api-us.example.com    # Latency: 10ms (US users)
  EU-West:    api-eu.example.com    # Latency: 15ms (EU users)
  AP-Southeast: api-ap.example.com  # Latency: 20ms (Asia users)

GeoDNS:
  Route users to nearest region
  Failover to next closest on failure

Resource Optimization

Container Resources

Kubernetes resource limits:
deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: litellm
spec:
  template:
    spec:
      containers:
        - name: litellm
          resources:
            requests:
              cpu: 500m      # 0.5 CPU
              memory: 512Mi  # 512MB
            limits:
              cpu: 2000m     # 2 CPU
              memory: 2Gi    # 2GB
Sizing guidelines:
TrafficCPUMemoryReplicas
< 100 RPS500m512Mi2
100-500 RPS1000m1Gi3
500-1000 RPS2000m2Gi5
1000-5000 RPS2000m4Gi10-20
> 5000 RPS4000m8Gi20+

Memory Optimization

config.yaml
general_settings:
  # Reduce memory usage
  drop_params: true  # Don't store full request in memory
  
litellm_settings:
  max_tokens: 4096  # Limit response size
  
router_settings:
  max_parallel_requests: 100  # Limit concurrent requests
Monitor memory:
# Kubernetes
kubectl top pods -l app=litellm

# Docker
docker stats litellm

# Prometheus query
container_memory_usage_bytes{pod=~"litellm.*"}

Database Optimization

Query Optimization

LiteLLM maintains aggregated tables for fast queries:
-- Use daily aggregates (faster than querying raw logs)
SELECT 
  date,
  SUM(spend) as daily_spend,
  SUM(api_requests) as requests
FROM "LiteLLM_DailyUserSpend"
WHERE date >= CURRENT_DATE - INTERVAL '30 days'
  AND user_id = 'user-123'
GROUP BY date;

-- Instead of (slower):
SELECT 
  DATE("startTime") as date,
  SUM(spend) as daily_spend,
  COUNT(*) as requests
FROM "LiteLLM_SpendLogs"
WHERE "startTime" >= CURRENT_DATE - INTERVAL '30 days'
  AND "user" = 'user-123'
GROUP BY date;

Index Optimization

Prisma creates indexes automatically, but add custom indexes for hot queries:
-- Index for API key lookups
CREATE INDEX CONCURRENTLY idx_spend_logs_api_key_time 
ON "LiteLLM_SpendLogs" (api_key, "startTime" DESC);

-- Index for team queries
CREATE INDEX CONCURRENTLY idx_spend_logs_team_time 
ON "LiteLLM_SpendLogs" (team_id, "startTime" DESC);

-- Partial index for recent logs
CREATE INDEX CONCURRENTLY idx_spend_logs_recent 
ON "LiteLLM_SpendLogs" ("startTime") 
WHERE "startTime" >= NOW() - INTERVAL '7 days';

Partitioning

For high-volume deployments, partition large tables:
-- Partition spend logs by month
CREATE TABLE "LiteLLM_SpendLogs_2024_01" 
  PARTITION OF "LiteLLM_SpendLogs"
  FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

CREATE TABLE "LiteLLM_SpendLogs_2024_02" 
  PARTITION OF "LiteLLM_SpendLogs"
  FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');

-- Automatically create partitions
CREATE EXTENSION IF NOT EXISTS pg_partman;

SELECT create_parent(
  'public.LiteLLM_SpendLogs',
  'startTime',
  'native',
  'monthly'
);

Archive Old Data

#!/bin/bash
# Archive logs older than 90 days

# Export to S3
pg_dump -h postgres -U litellm -t '"LiteLLM_SpendLogs"' \
  --where "\"startTime\" < NOW() - INTERVAL '90 days'" \
  -Fc > /tmp/archived_logs.dump

aws s3 cp /tmp/archived_logs.dump \
  s3://litellm-archives/logs/archive-$(date +%Y%m%d).dump

# Delete from database
psql -h postgres -U litellm -d litellm -c "
  DELETE FROM \"LiteLLM_SpendLogs\"
  WHERE \"startTime\" < NOW() - INTERVAL '90 days';
  VACUUM ANALYZE \"LiteLLM_SpendLogs\";
"

Monitoring Performance

Key Metrics

# Request rate
rate(litellm_requests_total[5m])

# Latency percentiles
histogram_quantile(0.50, rate(litellm_request_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(litellm_request_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(litellm_request_duration_seconds_bucket[5m]))

# Error rate
sum(rate(litellm_requests_total{status!="success"}[5m])) /
sum(rate(litellm_requests_total[5m]))

# Cache hit rate
sum(rate(litellm_cache_hits_total[5m])) /
(sum(rate(litellm_cache_hits_total[5m])) + sum(rate(litellm_cache_misses_total[5m])))

# Provider latency breakdown
sum(rate(litellm_provider_latency_seconds_sum[5m])) by (provider) /
sum(rate(litellm_provider_latency_seconds_count[5m])) by (provider)

Grafana Dashboard

Performance overview panel:
{
  "title": "Request Rate",
  "targets": [
    {
      "expr": "sum(rate(litellm_requests_total[5m]))",
      "legendFormat": "Total RPS"
    }
  ]
},
{
  "title": "Latency (P50/P95/P99)",
  "targets": [
    {
      "expr": "histogram_quantile(0.50, rate(litellm_request_duration_seconds_bucket[5m]))",
      "legendFormat": "P50"
    },
    {
      "expr": "histogram_quantile(0.95, rate(litellm_request_duration_seconds_bucket[5m]))",
      "legendFormat": "P95"
    },
    {
      "expr": "histogram_quantile(0.99, rate(litellm_request_duration_seconds_bucket[5m]))",
      "legendFormat": "P99"
    }
  ]
}

Load Testing

K6 Load Test

load-test.js
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 100 },   // Ramp up to 100 RPS
    { duration: '5m', target: 100 },   // Stay at 100 RPS
    { duration: '2m', target: 500 },   // Ramp up to 500 RPS
    { duration: '5m', target: 500 },   // Stay at 500 RPS
    { duration: '2m', target: 0 },     // Ramp down
  ],
  thresholds: {
    http_req_duration: ['p(95)<5000'],  // 95% < 5s
    http_req_failed: ['rate<0.01'],      // Error rate < 1%
  },
};

const API_KEY = 'sk-...';
const BASE_URL = 'https://api.example.com';

export default function () {
  const payload = JSON.stringify({
    model: 'gpt-4o',
    messages: [{
      role: 'user',
      content: 'Hello, how are you?'
    }],
    max_tokens: 50,
  });

  const params = {
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${API_KEY}`,
    },
  };

  const res = http.post(
    `${BASE_URL}/v1/chat/completions`,
    payload,
    params
  );

  check(res, {
    'status is 200': (r) => r.status === 200,
    'has choices': (r) => JSON.parse(r.body).choices?.length > 0,
  });

  sleep(1);
}
Run:
k6 run load-test.js

Locust Load Test

locustfile.py
from locust import HttpUser, task, between
import json

class LiteLLMUser(HttpUser):
    wait_time = between(1, 3)
    
    def on_start(self):
        self.headers = {
            "Authorization": "Bearer sk-...",
            "Content-Type": "application/json"
        }
    
    @task(10)
    def chat_completion(self):
        payload = {
            "model": "gpt-4o",
            "messages": [{"role": "user", "content": "Hello"}],
            "max_tokens": 50
        }
        self.client.post(
            "/v1/chat/completions",
            data=json.dumps(payload),
            headers=self.headers
        )
    
    @task(1)
    def embeddings(self):
        payload = {
            "model": "text-embedding-3-small",
            "input": "Hello world"
        }
        self.client.post(
            "/v1/embeddings",
            data=json.dumps(payload),
            headers=self.headers
        )
Run:
locust -f locustfile.py --host https://api.example.com

Performance Tuning Checklist

1

Enable Caching

  • Redis caching enabled
  • Semantic caching for similar requests
  • Cache TTL optimized
  • Cache hit rate > 30%
2

Connection Pooling

  • PgBouncer for database
  • HTTP connection reuse enabled
  • Pool sizes optimized
3

Load Balancing

  • Multiple provider deployments
  • Latency-based routing
  • Geographic distribution
  • Automatic failover
4

Resource Optimization

  • Right-sized container resources
  • Autoscaling configured
  • Memory limits set
  • Worker count optimized
5

Database Performance

  • Indexes on hot queries
  • Query optimization
  • Old data archived
  • Connection pooling
6

Monitoring

  • Latency tracking
  • Error rate monitoring
  • Resource usage dashboards
  • Load testing regularly

Best Practices

  1. Cache aggressively - 30-50% cache hit rate saves significant costs
  2. Use streaming - Reduces perceived latency for long responses
  3. Deploy globally - Route users to nearest region
  4. Monitor everything - Track latency, errors, cache hits, resource usage
  5. Load test regularly - Find bottlenecks before users do
  6. Right-size resources - Start small, scale based on metrics
  7. Archive old data - Keep database lean and fast
  8. Use batching - For non-real-time workloads

Next Steps

Monitoring

Track performance metrics

High Availability

Scale for production traffic

Troubleshooting

Debug performance issues

Security

Optimize without compromising security

Build docs developers (and LLMs) love