Overview
LLM Gateway includes built-in response caching using Redis. Identical requests return cached responses instantly, reducing latency from seconds to milliseconds and cutting costs by avoiding redundant API calls.
How It Works
Caching is automatic and transparent:
Request arrives at the gateway
Cache key is generated from the request payload
Cache is checked for existing response
If cached : Return response immediately (< 10ms)
If not cached : Forward to provider, cache response, return to client
packages/cache/src/cache.ts
export function generateCacheKey ( payload : Record < string , any >) : string {
return crypto
. createHash ( "sha256" )
. update ( JSON . stringify ( payload ))
. digest ( "hex" );
}
export async function getCache ( key : string ) : Promise < any | null > {
try {
const cachedValue = await redisClient . get ( key );
if ( ! cachedValue ) {
return null ;
}
return JSON . parse ( cachedValue );
} catch ( error ) {
logger . error ( "Error getting cache:" , error as Error );
return null ;
}
}
export async function setCache (
key : string ,
value : any ,
expirationSeconds : number ,
) : Promise < void > {
try {
await redisClient . set ( key , JSON . stringify ( value ), "EX" , expirationSeconds );
} catch ( error ) {
logger . error ( "Error setting cache:" , error as Error );
}
}
Cache Key Generation
The cache key is a SHA-256 hash of:
Model name
Messages (full conversation history)
Temperature
Max tokens
Top P
Frequency penalty
Presence penalty
Tools and tool choice
Response format
All other request parameters
Even tiny changes to the request (like adding a space) will result in a different cache key and cache miss.
Streaming Cache
Streaming responses are also cached:
packages/cache/src/cache.ts
interface StreamingCacheChunk {
data : string ;
eventId : number ;
event ?: string ;
timestamp : number ;
}
interface StreamingCacheData {
chunks : StreamingCacheChunk [];
metadata : {
model : string ;
provider : string ;
finishReason : string | null ;
totalChunks : number ;
duration : number ;
completed : boolean ;
};
}
export function generateStreamingCacheKey (
payload : Record < string , any >,
) : string {
return `stream: ${ generateCacheKey ( payload ) } ` ;
}
export async function setStreamingCache (
key : string ,
data : StreamingCacheData ,
expirationSeconds : number ,
) : Promise < void > {
await redisClient . set ( key , JSON . stringify ( data ), "EX" , expirationSeconds );
}
Cache Expiration
Caches have configurable TTL (time-to-live):
// Default: 1 hour for non-streaming
const expirationSeconds = 3600 ;
// Streaming responses: 30 minutes
const streamingExpirationSeconds = 1800 ;
Configure cache TTL via environment variables:
CACHE_TTL_SECONDS - Non-streaming responses
STREAMING_CACHE_TTL_SECONDS - Streaming responses
Database Caching
LLM Gateway also caches database queries:
apps/gateway/src/lib/cached-queries.ts
export async function findApiKeyByToken ( token : string ) {
const cacheKey = `api_key: ${ token } ` ;
const cached = await getCache ( cacheKey );
if ( cached ) {
return cached ;
}
const apiKey = await db . query . apiKey . findFirst ({
where: { token: { eq: token }, status: { eq: "active" } }
});
// Cache for 5 minutes
await setCache ( cacheKey , apiKey , 300 );
return apiKey ;
}
Cached queries:
API keys - 5 minute TTL
Projects - 5 minute TTL
Organizations - 5 minute TTL
Provider keys - 5 minute TTL
Custom provider keys - 5 minute TTL
When Caching Helps
Repeated Questions Users asking the same question multiple times
Static Content Generating content that doesn’t change often
Documentation Answering common documentation questions
Autocomplete Code completion with identical contexts
When Caching Doesn’t Help
Unique Requests Every request is different (user-specific data)
Time-Sensitive Responses need to be fresh (news, weather)
Creative Tasks High temperature generates different outputs
Streaming Required Client needs real-time streaming, not cached chunks
Detecting Cache Hits
Cache hits are indicated in the response:
{
"id" : "chatcmpl-xyz" ,
"cached" : true ,
"usage" : {
"prompt_tokens" : 150 ,
"completion_tokens" : 50 ,
"total_tokens" : 200
},
"metadata" : {
"used_provider" : "openai" ,
"used_model" : "gpt-4o"
}
}
Disabling Cache
Caching is enabled by default for all requests. To disable:
# Set in environment
DISABLE_CACHE = true
Or check the source:
packages/db/src/queries.ts
export async function isCachingEnabled (
organizationId : string
) : Promise < boolean > {
if ( process . env . DISABLE_CACHE === "true" ) {
return false ;
}
// Check organization settings
const org = await db . query . organization . findFirst ({
where: { id: { eq: organizationId } }
});
return org ?. cachingEnabled ?? true ;
}
Cache Hit Rates
Monitor cache effectiveness:
SELECT
DATE (created_at) as date ,
COUNT ( * ) as total_requests,
SUM ( CASE WHEN cached THEN 1 ELSE 0 END ) as cache_hits,
ROUND ( 100 . 0 * SUM ( CASE WHEN cached THEN 1 ELSE 0 END ) / COUNT ( * ), 2 ) as hit_rate
FROM logs
WHERE organization_id = 'org_abc'
GROUP BY DATE (created_at)
ORDER BY date DESC ;
Latency Improvement
Typical latencies:
Scenario Latency Cache hit < 10ms Cache miss (GPT-4o) 800-2000ms Cache miss (Claude) 1200-3000ms Cache miss (Gemini) 600-1500ms
Cache hits are 100-300x faster than calling the provider API.
Cost Savings
Cached responses don’t count toward your API usage:
if ( cachedResponse ) {
// No provider API call = no cost
// No token usage deducted
return cachedResponse ;
}
Example Savings
Cache Hit Rate Requests/Day Cost Without Cache Cost With Cache Savings 30% 10,000 $50.00 $35.00 $15.00/day 50% 10,000 $50.00 $25.00 $25.00/day 70% 10,000 $50.00 $15.00 $35.00/day
Redis Configuration
LLM Gateway uses Redis for caching:
packages/cache/src/redis.ts
import { createClient } from "redis" ;
export const redisClient = createClient ({
url: process . env . REDIS_URL || "redis://localhost:6379" ,
socket: {
connectTimeout: 10000 ,
reconnectStrategy : ( retries ) => {
if ( retries > 10 ) {
return new Error ( "Redis reconnect failed" );
}
return Math . min ( retries * 100 , 3000 );
},
},
});
Self-Hosting
When self-hosting, configure Redis:
# docker-compose.yml
services:
redis:
image: redis:7-alpine
ports:
- "6379:6379"
volumes:
- redis-data:/data
command : redis-server --maxmemory 2gb --maxmemory-policy allkeys-lru
volumes:
redis-data:
Cache Invalidation
Caches automatically expire based on TTL. Manual invalidation:
# Clear all caches
redis-cli FLUSHDB
# Clear specific cache key
redis-cli DEL "stream:abc123..."
# Clear all streaming caches
redis-cli --scan --pattern "stream:*" | xargs redis-cli DEL
Best Practices
Use Lower Temperature Temperature = 0 ensures deterministic responses
Normalize Inputs Trim whitespace and normalize formatting
Monitor Hit Rates Track cache effectiveness in analytics
Set Appropriate TTL Balance freshness vs. cache hits
Limitations
Cache size - Limited by available Redis memory
Cache misses - Any parameter change invalidates cache
No partial matching - Exact match required
Test mode - Caching disabled when NODE_ENV=test