Caching Support

Overview

LiteLLM provides comprehensive caching capabilities to reduce API costs and improve response times by storing and reusing LLM responses. Cache responses from completion, embedding, transcription, and other API calls across multiple backend systems.

Supported Cache Types

LiteLLM supports multiple caching backends:

Local (In-Memory) - Default, fastest for single-instance deployments
Redis - Distributed caching with Redis or Redis Cluster
Redis Semantic Cache - Similarity-based caching using embeddings
Qdrant Semantic Cache - Vector-based semantic caching
S3 - Object storage caching
GCS - Google Cloud Storage caching
Azure Blob - Azure Blob Storage caching
Disk - File-system based caching

Quick Start

Basic In-Memory Caching

import litellm
from litellm import completion
from litellm.caching import Cache

# Enable local in-memory cache
litellm.cache = Cache()

response = completion(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Hello!"}]
)
# First call hits the API

response = completion(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Hello!"}]
)
# Second call returns cached response

Redis Cache

from litellm import completion
from litellm.caching import Cache

# Enable Redis caching
litellm.cache = Cache(
    type="redis",
    host="localhost",
    port=6379,
    password="your-password",
    ttl=3600  # Cache for 1 hour
)

response = completion(
    model="gpt-4",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

Configuration Options

Cache Initialization

from litellm.caching import Cache

cache = Cache(
    type="redis",  # Cache type
    host="localhost",  # Redis host
    port=6379,  # Redis port
    password="password",  # Redis password
    ttl=3600,  # Time to live in seconds
    namespace="litellm",  # Cache key namespace
    mode="default_on",  # Cache mode: "default_on" or "default_off"
    supported_call_types=[  # API calls to cache
        "completion",
        "acompletion",
        "embedding",
        "aembedding",
        "transcription",
        "atranscription"
    ]
)

Redis Cluster Support

from litellm.caching import Cache

litellm.cache = Cache(
    type="redis",
    host="localhost",
    port=6379,
    password="password",
    redis_startup_nodes=[
        {"host": "redis-node-1", "port": 6379},
        {"host": "redis-node-2", "port": 6379},
        {"host": "redis-node-3", "port": 6379}
    ]
)

Semantic Caching

Semantic caching uses embeddings to match similar queries, not just exact matches.

Redis Semantic Cache

from litellm.caching import Cache

litellm.cache = Cache(
    type="redis-semantic",
    host="localhost",
    port=6379,
    password="password",
    similarity_threshold=0.8,  # 0-1 similarity score
    redis_semantic_cache_embedding_model="text-embedding-ada-002",
    redis_semantic_cache_index_name="litellm-semantic-cache"
)

# These similar queries will match
response1 = completion(
    model="gpt-4",
    messages=[{"role": "user", "content": "What is machine learning?"}]
)

response2 = completion(
    model="gpt-4",
    messages=[{"role": "user", "content": "Explain machine learning"}]
)
# Returns cached response if similarity > 0.8

Qdrant Semantic Cache

from litellm.caching import Cache

litellm.cache = Cache(
    type="qdrant-semantic",
    qdrant_api_base="http://localhost:6333",
    qdrant_api_key="your-api-key",
    qdrant_collection_name="litellm_cache",
    similarity_threshold=0.8,
    qdrant_semantic_cache_embedding_model="text-embedding-ada-002"
)

Cloud Storage Caching

S3 Cache

from litellm.caching import Cache

litellm.cache = Cache(
    type="s3",
    s3_bucket_name="litellm-cache",
    s3_region_name="us-west-2",
    s3_aws_access_key_id="your-access-key",
    s3_aws_secret_access_key="your-secret-key",
    s3_path="cache/"
)

GCS Cache

from litellm.caching import Cache

litellm.cache = Cache(
    type="gcs",
    gcs_bucket_name="litellm-cache",
    gcs_path_service_account="/path/to/service-account.json",
    gcs_path="cache/"
)

Azure Blob Cache

from litellm.caching import Cache

litellm.cache = Cache(
    type="azure_blob",
    azure_account_url="https://myaccount.blob.core.windows.net",
    azure_blob_container="litellm-cache"
)

Advanced Features

Cache Control

Control caching behavior per request:

# Set custom TTL for a specific request
response = completion(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}],
    cache={
        "ttl": 7200  # Cache for 2 hours
    }
)

# Set max age for cache retrieval
response = completion(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}],
    cache={
        "s-maxage": 600  # Only use cache if younger than 10 minutes
    }
)

# Custom namespace
response = completion(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}],
    cache={
        "namespace": "production"
    }
)

Opt-in Caching Mode

from litellm.caching import Cache, CacheMode

# Cache is opt-in only
litellm.cache = Cache(
    type="redis",
    host="localhost",
    port=6379,
    mode=CacheMode.default_off
)

# This request is NOT cached
response1 = completion(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}]
)

# This request IS cached (opt-in)
response2 = completion(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}],
    cache={"use-cache": True}
)

Caching Across Model Groups

Cache responses across different models in the same group:

from litellm import Router

router = Router(
    model_list=[
        {
            "model_name": "gpt-3.5",
            "litellm_params": {"model": "gpt-3.5-turbo"}
        },
        {
            "model_name": "gpt-4",
            "litellm_params": {"model": "gpt-4"}
        }
    ],
    caching_groups=[("gpt-3.5", "gpt-4")]  # Share cache
)

# These will share cached responses
response1 = router.completion(
    model="gpt-3.5",
    messages=[{"role": "user", "content": "Hello"}]
)

response2 = router.completion(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}]
)

Cache Key Generation

Cache keys are generated from:

Model name
Messages/input
All API parameters (temperature, max_tokens, etc.)
Namespace (if configured)

The cache key is hashed using SHA-256 for consistency.

Changing any parameter (even optional ones) will create a different cache key.

Performance Considerations

Cache Type Performance

Local: Fastest, but not shared across instances
Redis: Good balance of speed and distribution
Semantic: Slower due to embedding computation, but matches similar queries
Cloud Storage (S3/GCS/Azure): Higher latency, use for long-term storage

Monitoring Cache Usage

# Check cache hit/miss in response metadata
response = completion(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}]
)

# Cache information is available in logging callbacks
print(response._hidden_params.get("cache_hit"))

Best Practices

Use Redis for production - Enables distributed caching across multiple instances
Set appropriate TTLs - Balance between cost savings and data freshness
Use semantic caching for Q&A - Great for customer support and documentation queries
Monitor cache hit rates - Track effectiveness of your caching strategy
Use namespaces - Separate cache spaces for different environments or use cases

Disabling Cache

import litellm
from litellm.caching import disable_cache

# Disable caching globally
disable_cache()

# Or set to None
litellm.cache = None

Load Balancing - Distribute requests across deployments
Cost Tracking - Monitor cache savings
Logging - Track cache hits and misses

Get Started

Python SDK

AI Gateway (Proxy)

Core Features

Advanced

Caching Support

Overview

Supported Cache Types

Quick Start

Basic In-Memory Caching

Redis Cache

Configuration Options

Cache Initialization

Redis Cluster Support

Semantic Caching

Redis Semantic Cache

Qdrant Semantic Cache

Cloud Storage Caching

S3 Cache

GCS Cache

Azure Blob Cache

Advanced Features

Cache Control

Opt-in Caching Mode

Caching Across Model Groups

Cache Key Generation

Performance Considerations

Cache Type Performance

Monitoring Cache Usage

Best Practices

Disabling Cache

Build docs developers (and LLMs) love

Get Started

Python SDK

AI Gateway (Proxy)

Core Features

Advanced

​Overview

​Supported Cache Types

​Quick Start

​Basic In-Memory Caching

​Redis Cache

​Configuration Options

​Cache Initialization

​Redis Cluster Support

​Semantic Caching

​Redis Semantic Cache

​Qdrant Semantic Cache

​Cloud Storage Caching

​S3 Cache

​GCS Cache

​Azure Blob Cache

​Advanced Features

​Cache Control

​Opt-in Caching Mode

​Caching Across Model Groups

​Cache Key Generation

​Performance Considerations

Cache Type Performance

​Monitoring Cache Usage

​Best Practices

​Disabling Cache

​Related Features

Build docs developers (and LLMs) love

Overview

Supported Cache Types

Quick Start

Basic In-Memory Caching

Redis Cache

Configuration Options

Cache Initialization

Redis Cluster Support

Semantic Caching

Redis Semantic Cache

Qdrant Semantic Cache

Cloud Storage Caching

S3 Cache

GCS Cache

Azure Blob Cache

Advanced Features

Cache Control

Opt-in Caching Mode

Caching Across Model Groups

Cache Key Generation

Performance Considerations

Monitoring Cache Usage

Best Practices

Disabling Cache

Related Features