Overview
LLM Gateway Core uses Redis-backed caching to avoid redundant calls to LLM providers. When a request is made, the system first checks if an identical request has been cached, returning the cached response immediately if available.Benefits of Caching
Reduced Latency
Cache hits return in milliseconds instead of seconds
Cost Savings
Avoid paying for duplicate API calls to cloud providers
Rate Limit Protection
Reduce load on provider APIs and avoid hitting their limits
Improved Reliability
Serve cached responses even when providers are slow or down
Cache Implementation
TheRedisCache class in app/core/cache.py handles all caching operations:
app/core/cache.py:1-33
Cache Key Generation
The cache key is generated from the request parameters using a deterministic hash function inapp/core/cache_key.py:
app/core/cache_key.py:1-13
Key Components
The cache key includes:- messages - The full conversation history
- model_hint - The routing hint (affects which provider is used)
- max_tokens - Token limit for the response
The key is deterministic - identical requests always generate the same cache key, ensuring cache hits for duplicate requests.
SHA-256 Hashing
The system uses SHA-256 to create a fixed-length key from the request:- Deterministic - Same input always produces same hash
- Compact - 64-character hex string regardless of request size
- Collision-resistant - Extremely unlikely for different requests to produce the same hash
Cache Flow
Integration with ChatService
The cache is checked at the beginning of each request inChatService.chat():
app/core/service.py:39-63
Time-to-Live (TTL)
Cached responses expire after a configurable TTL:CACHE_TTL_SECONDS environment variable. Choose your TTL based on:
- Short TTL (60-300s) - For rapidly changing data or when freshness is critical
- Medium TTL (600-3600s) - Balanced approach for most use cases
- Long TTL (3600s+) - For static content or when cost savings are paramount
Serialization
The cache serializesChatResponse objects to JSON:
Writing to Cache
Reading from Cache
- Uses Pydantic’s built-in JSON serialization
- Ensures type safety on deserialization
- Handles nested objects and validation automatically
Error Handling
The cache implements graceful degradation:- Cache errors are logged but not raised
- Cache misses are recorded in metrics
- Request continues to provider on cache failure
- System remains operational even if Redis is down
The cache “fails open” - if Redis is unavailable, requests still work but go directly to providers.
Metrics and Observability
The cache records Prometheus metrics for monitoring:Key Metrics to Monitor
- Cache Hit Rate -
CACHE_HITS / (CACHE_HITS + CACHE_MISSES) - Cache Errors - Logged to console for debugging
- Provider Savings - Requests avoided by cache hits
Configuration
Cache behavior is controlled by environment variables:Cache Invalidation
Currently, the cache uses TTL-based expiration only. To invalidate cache entries:Manual Invalidation
Selective Invalidation
To implement selective invalidation, you could extend theRedisCache class:
Best Practices
Choose appropriate TTL
Choose appropriate TTL
Set TTL based on how quickly your data becomes stale. For LLM responses, 5-10 minutes is often a good balance.
Monitor cache hit rate
Monitor cache hit rate
Track
CACHE_HITS / (CACHE_HITS + CACHE_MISSES). A low hit rate may indicate TTL is too short or requests are too diverse.Use Redis persistence
Use Redis persistence
Configure Redis with AOF or RDB persistence to preserve cache across restarts and maximize hit rate.
Consider cache warming
Consider cache warming
For predictable queries, pre-populate the cache during deployment to improve initial response times.
Example: Cache Hit vs Miss
First Request (Cache Miss)
Second Identical Request (Cache Hit)
Next Steps
Architecture
Understand how caching fits into the overall architecture
Rate Limiting
Learn about Redis-based rate limiting
Routing
See how model_hint affects cache keys
Monitoring
Set up metrics and observability