Caching

Overview

LLM Gateway Core uses Redis-backed caching to avoid redundant calls to LLM providers. When a request is made, the system first checks if an identical request has been cached, returning the cached response immediately if available.

Benefits of Caching

Reduced Latency

Cache hits return in milliseconds instead of seconds

Cost Savings

Avoid paying for duplicate API calls to cloud providers

Rate Limit Protection

Reduce load on provider APIs and avoid hitting their limits

Improved Reliability

Serve cached responses even when providers are slow or down

Cache Implementation

The RedisCache class in app/core/cache.py handles all caching operations:

from typing import Any, Optional
import redis
from app.core.config import settings
from app.core.metrics import CACHE_HITS, CACHE_MISSES
from app.api.v1.schemas import ChatResponse

class RedisCache:
    def __init__(self, ttl_seconds: int = settings.CACHE_TTL_SECONDS):
        self.ttl = ttl_seconds
        self.client = redis.from_url(settings.REDIS_URL, decode_responses=True)
    
    def get(self, key: str) -> Optional[ChatResponse]:
        try:
            data = self.client.get(key)
            if not data:
                CACHE_MISSES.inc()
                return None
            
            CACHE_HITS.inc()
            # Parse back into ChatResponse
            return ChatResponse.model_validate_json(data)
        except Exception as e:
            print(f"[CACHE ERROR] get key={key}: {e}")
            CACHE_MISSES.inc()
            return None
    
    def set(self, key: str, value: ChatResponse):
        try:
            # Serialize ChatResponse to JSON
            serialized_value = value.model_dump_json()
            self.client.set(key, serialized_value, ex=self.ttl)
        except Exception as e:
            print(f"[CACHE ERROR] set key={key}: {e}")

Source: app/core/cache.py:1-33

Cache Key Generation

The cache key is generated from the request parameters using a deterministic hash function in app/core/cache_key.py:

import hashlib
import json


def build_cache_key(request: dict) -> str:
    normalized = {
        "messages": [m.model_dump() for m in request.messages],
        "model_hint": request.model_hint,
        "max_tokens": request.max_tokens,
    }

    raw = json.dumps(normalized, sort_keys=True)
    return hashlib.sha256(raw.encode("utf-8")).hexdigest()

Source: app/core/cache_key.py:1-13

Key Components

The cache key includes:

messages - The full conversation history
model_hint - The routing hint (affects which provider is used)
max_tokens - Token limit for the response

The key is deterministic - identical requests always generate the same cache key, ensuring cache hits for duplicate requests.

SHA-256 Hashing

The system uses SHA-256 to create a fixed-length key from the request:

Deterministic - Same input always produces same hash
Compact - 64-character hex string regardless of request size
Collision-resistant - Extremely unlikely for different requests to produce the same hash

Cache Flow

Integration with ChatService

The cache is checked at the beginning of each request in ChatService.chat():

async def chat(self, request: ChatRequest) -> ChatResponse:
    """
    Execute a chat completion request.
    Check cache first, then route to providers with retries.
    """
    REQUEST_TOTAL.inc()
    start = time.time()
    try: 
        ACTIVE_REQUESTS.inc()
        cache_key = build_cache_key(request)
        cached_response = self.cache.get(cache_key)
        if cached_response:
            print(f"[CACHE HIT] key={cache_key}")
            return cached_response.model_copy(update={"cached": True})
        print(f"[CACHE MISS] key={cache_key}")

        providers = self.router.route(request)
        # ... provider call logic ...
        response = await self._call_provider(provider, request)
        self.cache.set(cache_key, response)
        return response

Source: app/core/service.py:39-63

Cached responses are marked with cached: true in the response payload, allowing clients to distinguish between fresh and cached responses.

Time-to-Live (TTL)

Cached responses expire after a configurable TTL:

self.client.set(key, serialized_value, ex=self.ttl)

The TTL is set via the CACHE_TTL_SECONDS environment variable. Choose your TTL based on:

Short TTL (60-300s) - For rapidly changing data or when freshness is critical
Medium TTL (600-3600s) - Balanced approach for most use cases
Long TTL (3600s+) - For static content or when cost savings are paramount

Very long TTLs may serve stale responses. Consider your use case when configuring TTL.

Serialization

The cache serializes ChatResponse objects to JSON:

Writing to Cache

serialized_value = value.model_dump_json()
self.client.set(key, serialized_value, ex=self.ttl)

Reading from Cache

data = self.client.get(key)
return ChatResponse.model_validate_json(data)

This approach:

Uses Pydantic’s built-in JSON serialization
Ensures type safety on deserialization
Handles nested objects and validation automatically

Error Handling

The cache implements graceful degradation:

try:
    data = self.client.get(key)
    if not data:
        CACHE_MISSES.inc()
        return None
    
    CACHE_HITS.inc()
    return ChatResponse.model_validate_json(data)
except Exception as e:
    print(f"[CACHE ERROR] get key={key}: {e}")
    CACHE_MISSES.inc()
    return None

Failure behavior:

Cache errors are logged but not raised
Cache misses are recorded in metrics
Request continues to provider on cache failure
System remains operational even if Redis is down

The cache “fails open” - if Redis is unavailable, requests still work but go directly to providers.

Metrics and Observability

The cache records Prometheus metrics for monitoring:

CACHE_HITS.inc()   # Successful cache retrieval
CACHE_MISSES.inc() # Cache miss or error

Key Metrics to Monitor

Cache Hit Rate - CACHE_HITS / (CACHE_HITS + CACHE_MISSES)
Cache Errors - Logged to console for debugging
Provider Savings - Requests avoided by cache hits

Configuration

Cache behavior is controlled by environment variables:

# Redis connection
REDIS_URL=redis://localhost:6379/0

# Cache TTL in seconds
CACHE_TTL_SECONDS=600

Cache Invalidation

Currently, the cache uses TTL-based expiration only. To invalidate cache entries:

Manual Invalidation

# Flush all cache entries
redis-cli FLUSHDB

# Delete specific key
redis-cli DEL <cache_key_hash>

Selective Invalidation

To implement selective invalidation, you could extend the RedisCache class:

def delete(self, key: str):
    """Delete a specific cache entry."""
    try:
        self.client.delete(key)
    except Exception as e:
        print(f"[CACHE ERROR] delete key={key}: {e}")

def flush_all(self):
    """Clear all cache entries."""
    try:
        self.client.flushdb()
    except Exception as e:
        print(f"[CACHE ERROR] flush_all: {e}")

Best Practices

Choose appropriate TTL

Set TTL based on how quickly your data becomes stale. For LLM responses, 5-10 minutes is often a good balance.

Monitor cache hit rate

Track CACHE_HITS / (CACHE_HITS + CACHE_MISSES). A low hit rate may indicate TTL is too short or requests are too diverse.

Use Redis persistence

Configure Redis with AOF or RDB persistence to preserve cache across restarts and maximize hit rate.

Consider cache warming

For predictable queries, pre-populate the cache during deployment to improve initial response times.

Example: Cache Hit vs Miss

First Request (Cache Miss)

curl -X POST http://localhost:8000/chat \
  -H "X-API-Key: your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Hello"}],
    "model_hint": "fast"
  }'

# Response (after 2s provider call)
{
  "content": "Hello! How can I help you?",
  "cached": false
}

Second Identical Request (Cache Hit)

curl -X POST http://localhost:8000/chat \
  -H "X-API-Key: your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Hello"}],
    "model_hint": "fast"
  }'

# Response (instant, <10ms)
{
  "content": "Hello! How can I help you?",
  "cached": true
}

Next Steps

Architecture

Understand how caching fits into the overall architecture

Rate Limiting

Learn about Redis-based rate limiting

Routing

See how model_hint affects cache keys

Monitoring

Set up metrics and observability

Get Started

Core Concepts

Providers

Observability

Deployment

Overview

Benefits of Caching

Reduced Latency

Cost Savings

Rate Limit Protection

Improved Reliability

Cache Implementation

Cache Key Generation

Key Components

SHA-256 Hashing

Cache Flow

Integration with ChatService

Time-to-Live (TTL)

Serialization

Writing to Cache

Reading from Cache

Error Handling

Metrics and Observability

Key Metrics to Monitor

Configuration

Cache Invalidation

Manual Invalidation

Selective Invalidation

Best Practices

Example: Cache Hit vs Miss

First Request (Cache Miss)

Second Identical Request (Cache Hit)

Next Steps

Architecture

Rate Limiting

Routing

Monitoring

Build docs developers (and LLMs) love

Get Started

Core Concepts

Providers

Observability

Deployment

​Overview

​Benefits of Caching

Reduced Latency

Cost Savings

Rate Limit Protection

Improved Reliability

​Cache Implementation

​Cache Key Generation

​Key Components

​SHA-256 Hashing

​Cache Flow

​Integration with ChatService

​Time-to-Live (TTL)

​Serialization

​Writing to Cache

​Reading from Cache

​Error Handling

​Metrics and Observability

​Key Metrics to Monitor

​Configuration

​Cache Invalidation

​Manual Invalidation

​Selective Invalidation

​Best Practices

​Example: Cache Hit vs Miss

​First Request (Cache Miss)

​Second Identical Request (Cache Hit)

​Next Steps

Architecture

Rate Limiting

Routing

Monitoring

Build docs developers (and LLMs) love

Overview

Benefits of Caching

Cache Implementation

Cache Key Generation

Key Components

SHA-256 Hashing

Cache Flow

Integration with ChatService

Time-to-Live (TTL)

Serialization

Writing to Cache

Reading from Cache

Error Handling

Metrics and Observability

Key Metrics to Monitor

Configuration

Cache Invalidation

Manual Invalidation

Selective Invalidation

Best Practices

Example: Cache Hit vs Miss

First Request (Cache Miss)

Second Identical Request (Cache Hit)

Next Steps