Skip to main content

Overview

LiteLLM provides comprehensive caching capabilities to reduce API costs and improve response times by storing and reusing LLM responses. Cache responses from completion, embedding, transcription, and other API calls across multiple backend systems.

Supported Cache Types

LiteLLM supports multiple caching backends:
  • Local (In-Memory) - Default, fastest for single-instance deployments
  • Redis - Distributed caching with Redis or Redis Cluster
  • Redis Semantic Cache - Similarity-based caching using embeddings
  • Qdrant Semantic Cache - Vector-based semantic caching
  • S3 - Object storage caching
  • GCS - Google Cloud Storage caching
  • Azure Blob - Azure Blob Storage caching
  • Disk - File-system based caching

Quick Start

Basic In-Memory Caching

import litellm
from litellm import completion
from litellm.caching import Cache

# Enable local in-memory cache
litellm.cache = Cache()

response = completion(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Hello!"}]
)
# First call hits the API

response = completion(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Hello!"}]
)
# Second call returns cached response

Redis Cache

from litellm import completion
from litellm.caching import Cache

# Enable Redis caching
litellm.cache = Cache(
    type="redis",
    host="localhost",
    port=6379,
    password="your-password",
    ttl=3600  # Cache for 1 hour
)

response = completion(
    model="gpt-4",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

Configuration Options

Cache Initialization

from litellm.caching import Cache

cache = Cache(
    type="redis",  # Cache type
    host="localhost",  # Redis host
    port=6379,  # Redis port
    password="password",  # Redis password
    ttl=3600,  # Time to live in seconds
    namespace="litellm",  # Cache key namespace
    mode="default_on",  # Cache mode: "default_on" or "default_off"
    supported_call_types=[  # API calls to cache
        "completion",
        "acompletion",
        "embedding",
        "aembedding",
        "transcription",
        "atranscription"
    ]
)

Redis Cluster Support

from litellm.caching import Cache

litellm.cache = Cache(
    type="redis",
    host="localhost",
    port=6379,
    password="password",
    redis_startup_nodes=[
        {"host": "redis-node-1", "port": 6379},
        {"host": "redis-node-2", "port": 6379},
        {"host": "redis-node-3", "port": 6379}
    ]
)

Semantic Caching

Semantic caching uses embeddings to match similar queries, not just exact matches.

Redis Semantic Cache

from litellm.caching import Cache

litellm.cache = Cache(
    type="redis-semantic",
    host="localhost",
    port=6379,
    password="password",
    similarity_threshold=0.8,  # 0-1 similarity score
    redis_semantic_cache_embedding_model="text-embedding-ada-002",
    redis_semantic_cache_index_name="litellm-semantic-cache"
)

# These similar queries will match
response1 = completion(
    model="gpt-4",
    messages=[{"role": "user", "content": "What is machine learning?"}]
)

response2 = completion(
    model="gpt-4",
    messages=[{"role": "user", "content": "Explain machine learning"}]
)
# Returns cached response if similarity > 0.8

Qdrant Semantic Cache

from litellm.caching import Cache

litellm.cache = Cache(
    type="qdrant-semantic",
    qdrant_api_base="http://localhost:6333",
    qdrant_api_key="your-api-key",
    qdrant_collection_name="litellm_cache",
    similarity_threshold=0.8,
    qdrant_semantic_cache_embedding_model="text-embedding-ada-002"
)

Cloud Storage Caching

S3 Cache

from litellm.caching import Cache

litellm.cache = Cache(
    type="s3",
    s3_bucket_name="litellm-cache",
    s3_region_name="us-west-2",
    s3_aws_access_key_id="your-access-key",
    s3_aws_secret_access_key="your-secret-key",
    s3_path="cache/"
)

GCS Cache

from litellm.caching import Cache

litellm.cache = Cache(
    type="gcs",
    gcs_bucket_name="litellm-cache",
    gcs_path_service_account="/path/to/service-account.json",
    gcs_path="cache/"
)

Azure Blob Cache

from litellm.caching import Cache

litellm.cache = Cache(
    type="azure_blob",
    azure_account_url="https://myaccount.blob.core.windows.net",
    azure_blob_container="litellm-cache"
)

Advanced Features

Cache Control

Control caching behavior per request:
# Set custom TTL for a specific request
response = completion(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}],
    cache={
        "ttl": 7200  # Cache for 2 hours
    }
)

# Set max age for cache retrieval
response = completion(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}],
    cache={
        "s-maxage": 600  # Only use cache if younger than 10 minutes
    }
)

# Custom namespace
response = completion(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}],
    cache={
        "namespace": "production"
    }
)

Opt-in Caching Mode

from litellm.caching import Cache, CacheMode

# Cache is opt-in only
litellm.cache = Cache(
    type="redis",
    host="localhost",
    port=6379,
    mode=CacheMode.default_off
)

# This request is NOT cached
response1 = completion(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}]
)

# This request IS cached (opt-in)
response2 = completion(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}],
    cache={"use-cache": True}
)

Caching Across Model Groups

Cache responses across different models in the same group:
from litellm import Router

router = Router(
    model_list=[
        {
            "model_name": "gpt-3.5",
            "litellm_params": {"model": "gpt-3.5-turbo"}
        },
        {
            "model_name": "gpt-4",
            "litellm_params": {"model": "gpt-4"}
        }
    ],
    caching_groups=[("gpt-3.5", "gpt-4")]  # Share cache
)

# These will share cached responses
response1 = router.completion(
    model="gpt-3.5",
    messages=[{"role": "user", "content": "Hello"}]
)

response2 = router.completion(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}]
)

Cache Key Generation

Cache keys are generated from:
  • Model name
  • Messages/input
  • All API parameters (temperature, max_tokens, etc.)
  • Namespace (if configured)
The cache key is hashed using SHA-256 for consistency.
Changing any parameter (even optional ones) will create a different cache key.

Performance Considerations

Cache Type Performance

  • Local: Fastest, but not shared across instances
  • Redis: Good balance of speed and distribution
  • Semantic: Slower due to embedding computation, but matches similar queries
  • Cloud Storage (S3/GCS/Azure): Higher latency, use for long-term storage

Monitoring Cache Usage

# Check cache hit/miss in response metadata
response = completion(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}]
)

# Cache information is available in logging callbacks
print(response._hidden_params.get("cache_hit"))

Best Practices

  1. Use Redis for production - Enables distributed caching across multiple instances
  2. Set appropriate TTLs - Balance between cost savings and data freshness
  3. Use semantic caching for Q&A - Great for customer support and documentation queries
  4. Monitor cache hit rates - Track effectiveness of your caching strategy
  5. Use namespaces - Separate cache spaces for different environments or use cases

Disabling Cache

import litellm
from litellm.caching import disable_cache

# Disable caching globally
disable_cache()

# Or set to None
litellm.cache = None

Build docs developers (and LLMs) love