Overview
LiteLLM provides comprehensive caching capabilities to reduce API costs and improve response times by storing and reusing LLM responses. Cache responses from completion, embedding, transcription, and other API calls across multiple backend systems.
Supported Cache Types
LiteLLM supports multiple caching backends:
- Local (In-Memory) - Default, fastest for single-instance deployments
- Redis - Distributed caching with Redis or Redis Cluster
- Redis Semantic Cache - Similarity-based caching using embeddings
- Qdrant Semantic Cache - Vector-based semantic caching
- S3 - Object storage caching
- GCS - Google Cloud Storage caching
- Azure Blob - Azure Blob Storage caching
- Disk - File-system based caching
Quick Start
Basic In-Memory Caching
import litellm
from litellm import completion
from litellm.caching import Cache
# Enable local in-memory cache
litellm.cache = Cache()
response = completion(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hello!"}]
)
# First call hits the API
response = completion(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hello!"}]
)
# Second call returns cached response
Redis Cache
from litellm import completion
from litellm.caching import Cache
# Enable Redis caching
litellm.cache = Cache(
type="redis",
host="localhost",
port=6379,
password="your-password",
ttl=3600 # Cache for 1 hour
)
response = completion(
model="gpt-4",
messages=[{"role": "user", "content": "Explain quantum computing"}]
)
Configuration Options
Cache Initialization
from litellm.caching import Cache
cache = Cache(
type="redis", # Cache type
host="localhost", # Redis host
port=6379, # Redis port
password="password", # Redis password
ttl=3600, # Time to live in seconds
namespace="litellm", # Cache key namespace
mode="default_on", # Cache mode: "default_on" or "default_off"
supported_call_types=[ # API calls to cache
"completion",
"acompletion",
"embedding",
"aembedding",
"transcription",
"atranscription"
]
)
Redis Cluster Support
from litellm.caching import Cache
litellm.cache = Cache(
type="redis",
host="localhost",
port=6379,
password="password",
redis_startup_nodes=[
{"host": "redis-node-1", "port": 6379},
{"host": "redis-node-2", "port": 6379},
{"host": "redis-node-3", "port": 6379}
]
)
Semantic Caching
Semantic caching uses embeddings to match similar queries, not just exact matches.
Redis Semantic Cache
from litellm.caching import Cache
litellm.cache = Cache(
type="redis-semantic",
host="localhost",
port=6379,
password="password",
similarity_threshold=0.8, # 0-1 similarity score
redis_semantic_cache_embedding_model="text-embedding-ada-002",
redis_semantic_cache_index_name="litellm-semantic-cache"
)
# These similar queries will match
response1 = completion(
model="gpt-4",
messages=[{"role": "user", "content": "What is machine learning?"}]
)
response2 = completion(
model="gpt-4",
messages=[{"role": "user", "content": "Explain machine learning"}]
)
# Returns cached response if similarity > 0.8
Qdrant Semantic Cache
from litellm.caching import Cache
litellm.cache = Cache(
type="qdrant-semantic",
qdrant_api_base="http://localhost:6333",
qdrant_api_key="your-api-key",
qdrant_collection_name="litellm_cache",
similarity_threshold=0.8,
qdrant_semantic_cache_embedding_model="text-embedding-ada-002"
)
Cloud Storage Caching
S3 Cache
from litellm.caching import Cache
litellm.cache = Cache(
type="s3",
s3_bucket_name="litellm-cache",
s3_region_name="us-west-2",
s3_aws_access_key_id="your-access-key",
s3_aws_secret_access_key="your-secret-key",
s3_path="cache/"
)
GCS Cache
from litellm.caching import Cache
litellm.cache = Cache(
type="gcs",
gcs_bucket_name="litellm-cache",
gcs_path_service_account="/path/to/service-account.json",
gcs_path="cache/"
)
Azure Blob Cache
from litellm.caching import Cache
litellm.cache = Cache(
type="azure_blob",
azure_account_url="https://myaccount.blob.core.windows.net",
azure_blob_container="litellm-cache"
)
Advanced Features
Cache Control
Control caching behavior per request:
# Set custom TTL for a specific request
response = completion(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}],
cache={
"ttl": 7200 # Cache for 2 hours
}
)
# Set max age for cache retrieval
response = completion(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}],
cache={
"s-maxage": 600 # Only use cache if younger than 10 minutes
}
)
# Custom namespace
response = completion(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}],
cache={
"namespace": "production"
}
)
Opt-in Caching Mode
from litellm.caching import Cache, CacheMode
# Cache is opt-in only
litellm.cache = Cache(
type="redis",
host="localhost",
port=6379,
mode=CacheMode.default_off
)
# This request is NOT cached
response1 = completion(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}]
)
# This request IS cached (opt-in)
response2 = completion(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}],
cache={"use-cache": True}
)
Caching Across Model Groups
Cache responses across different models in the same group:
from litellm import Router
router = Router(
model_list=[
{
"model_name": "gpt-3.5",
"litellm_params": {"model": "gpt-3.5-turbo"}
},
{
"model_name": "gpt-4",
"litellm_params": {"model": "gpt-4"}
}
],
caching_groups=[("gpt-3.5", "gpt-4")] # Share cache
)
# These will share cached responses
response1 = router.completion(
model="gpt-3.5",
messages=[{"role": "user", "content": "Hello"}]
)
response2 = router.completion(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}]
)
Cache Key Generation
Cache keys are generated from:
- Model name
- Messages/input
- All API parameters (temperature, max_tokens, etc.)
- Namespace (if configured)
The cache key is hashed using SHA-256 for consistency.
Changing any parameter (even optional ones) will create a different cache key.
Cache Type Performance
- Local: Fastest, but not shared across instances
- Redis: Good balance of speed and distribution
- Semantic: Slower due to embedding computation, but matches similar queries
- Cloud Storage (S3/GCS/Azure): Higher latency, use for long-term storage
Monitoring Cache Usage
# Check cache hit/miss in response metadata
response = completion(
model="gpt-4",
messages=[{"role": "user", "content": "Hello"}]
)
# Cache information is available in logging callbacks
print(response._hidden_params.get("cache_hit"))
Best Practices
- Use Redis for production - Enables distributed caching across multiple instances
- Set appropriate TTLs - Balance between cost savings and data freshness
- Use semantic caching for Q&A - Great for customer support and documentation queries
- Monitor cache hit rates - Track effectiveness of your caching strategy
- Use namespaces - Separate cache spaces for different environments or use cases
Disabling Cache
import litellm
from litellm.caching import disable_cache
# Disable caching globally
disable_cache()
# Or set to None
litellm.cache = None