The Portkey AI Gateway includes built-in caching to reduce latency and API costs by storing and reusing LLM responses. The caching system supports both simple and semantic caching modes.
How Caching Works
When caching is enabled, the gateway:
Generates a cache key from the request body and URL
Checks for a cached response before making the provider request
Returns the cached response if found and not expired
Makes the provider request if cache miss
Stores the response in cache for future requests
Cache Configuration
Enable Caching
Caching can be enabled globally or per-request:
Global configuration (conf.json):
Per-request configuration :
{
"provider" : "openai" ,
"api_key" : "sk-..." ,
"cache" : {
"mode" : "simple" ,
"max_age" : 3600000
}
}
Cache Modes
Simple Caching
Caches based on exact request match:
{
"cache" : {
"mode" : "simple" ,
"max_age" : 3600000 // 1 hour in milliseconds
}
}
How it works:
Request body and URL are hashed to create a cache key
Exact match required for cache hit
TTL (time-to-live) specified by max_age
Default TTL is 24 hours (86,400,000 ms) if not specified
Semantic Caching
Semantic caching is available in the hosted and enterprise versions. The open-source gateway currently supports simple caching only.
Semantic caching matches requests with similar meaning:
{
"cache" : {
"mode" : "semantic" ,
"max_age" : 7200000 // 2 hours
}
}
Cache Key Generation
The cache key is generated using SHA-256 hashing:
const getCacheKey = async ( requestBody : any , url : string ) => {
const stringToHash = ` ${ JSON . stringify ( requestBody ) } - ${ url } ` ;
const encoded = new TextEncoder (). encode ( stringToHash );
const hash = await crypto . subtle . digest ({ name: 'SHA-256' }, encoded );
return Array . from ( new Uint8Array ( hash ))
. map ( b => b . toString ( 16 ). padStart ( 2 , '0' ))
. join ( '' );
}
Source: src/middlewares/cache/index.ts:14-26
What’s included in the key:
Complete request body (model, messages, temperature, etc.)
Request URL (provider endpoint)
All request parameters
What’s excluded:
Request headers (except those in the body)
Timestamps
User metadata
Cache Storage
In-Memory Cache
The default implementation uses in-memory storage:
const inMemoryCache : Record < string , CacheEntry > = {};
interface CacheEntry {
responseBody : string ; // Serialized JSON response
maxAge : number ; // Expiration timestamp
}
Source: src/middlewares/cache/index.ts:3
Characteristics:
Fast access (no network calls)
Lost on gateway restart
Not shared across gateway instances
Limited by available memory
Redis Cache
Redis caching is available in Node.js runtime when REDIS_CONNECTION_STRING is set.
For production deployments, use Redis for persistent, distributed caching:
REDIS_CONNECTION_STRING = redis://localhost:6379 npm start
Redis benefits:
Persistent across restarts
Shared across multiple gateway instances
Configurable eviction policies
Support for large cache sizes
Source: src/index.ts:49-51
Cache Lifecycle
Retrieving from Cache
const getFromCache = async (
env : any ,
requestHeaders : any ,
requestBody : any ,
url : string ,
organisationId : string ,
cacheMode : string ,
cacheMaxAge : number | null
) => {
// Check for force refresh header
if ( 'x-portkey-cache-force-refresh' in requestHeaders ) {
return [ null , CACHE_STATUS . REFRESH , null ];
}
const cacheKey = await getCacheKey ( requestBody , url );
if ( cacheKey in inMemoryCache ) {
const cacheObject = inMemoryCache [ cacheKey ];
// Check expiration
if ( cacheObject . maxAge && cacheObject . maxAge < Date . now ()) {
delete inMemoryCache [ cacheKey ];
return [ null , CACHE_STATUS . MISS , null ];
}
return [ cacheObject . responseBody , CACHE_STATUS . HIT , cacheKey ];
}
return [ null , CACHE_STATUS . MISS , null ];
}
Source: src/middlewares/cache/index.ts:29-58
Storing in Cache
const putInCache = async (
env : any ,
requestHeaders : any ,
requestBody : any ,
responseBody : any ,
url : string ,
organisationId : string ,
cacheMode : string | null ,
cacheMaxAge : number | null
) => {
// Don't cache streaming responses
if ( requestBody . stream ) {
return ;
}
const cacheKey = await getCacheKey ( requestBody , url );
inMemoryCache [ cacheKey ] = {
responseBody: JSON . stringify ( responseBody ),
maxAge: cacheMaxAge || Date . now () + ( 24 * 60 * 60 * 1000 )
};
}
Source: src/middlewares/cache/index.ts:60-82
Cache Status
The gateway tracks cache operations with status codes:
const CACHE_STATUS = {
HIT: 'HIT' , // Response served from cache
SEMANTIC_HIT: 'SEMANTIC HIT' , // Semantic match found
MISS: 'MISS' , // No cache entry found
SEMANTIC_MISS: 'SEMANTIC MISS' , // No semantic match
REFRESH: 'REFRESH' , // Force refresh requested
DISABLED: 'DISABLED' // Caching disabled
};
Source: src/middlewares/cache/index.ts:5-12
Cache Invalidation
Force Refresh
Bypass cache and fetch fresh response:
curl https://gateway.portkey.ai/v1/chat/completions \
-H "x-portkey-cache-force-refresh: true" \
-H "x-portkey-config: {...}" \
-d '{...}'
This returns a fresh response and updates the cache.
Expiration-based
Entries automatically expire when max_age is reached:
{
"cache" : {
"mode" : "simple" ,
"max_age" : 1800000 // 30 minutes
}
}
Manual Invalidation
For in-memory cache, entries are removed on:
Gateway restart
Expiration check during retrieval
For Redis cache:
Configure TTL in Redis
Use Redis commands to flush keys
Implement custom invalidation logic
Streaming and Caching
Streaming responses cannot be cached. The cache middleware automatically skips streaming requests.
if ( requestBody . stream ) {
// Does not support caching of streams
return ;
}
Reason: Streaming responses are sent incrementally as Server-Sent Events (SSE), making them incompatible with response caching.
Source: src/middlewares/cache/index.ts:70-73
Cache TTL Calculation
The effective TTL is calculated as:
const effectiveTTL = cacheMaxAge || ( 24 * 60 * 60 * 1000 );
const expirationTime = Date . now () + effectiveTTL ;
Default TTL: 24 hours (86,400,000 milliseconds)
Minimum TTL: No minimum enforced
Maximum TTL: No maximum enforced (be mindful of stale data)
Source: src/middlewares/cache/index.ts:107-108
Cache Middleware Integration
The cache middleware is registered in the request pipeline:
if ( conf . cache === true ) {
app . use ( '*' , memoryCache ());
}
Source: src/index.ts:108-110
The middleware:
Runs on all routes ('*')
Executes before request handlers
Stores responses after successful completion
Cache Hit Rate
Maximize cache effectiveness:
Use consistent request parameters
Avoid random seeds or IDs in requests
Group similar requests
Set appropriate TTLs
Memory Usage
Monitor memory consumption:
// Each cache entry size:
const entrySize =
JSON . stringify ( responseBody ). length + // Response size
64 + // Cache key (SHA-256)
8 ; // maxAge timestamp
Best practices:
Set reasonable max_age values
Use Redis for large caches
Monitor cache size metrics
Implement cache size limits
Latency Impact
Cache hit: ~0.1-1ms (in-memory) or ~1-5ms (Redis)
Cache miss: Full provider latency + cache write time
Cache key generation: ~0.1ms (SHA-256 hash)
Use Cases
Repeated Queries
{
"cache" : {
"mode" : "simple" ,
"max_age" : 3600000 // 1 hour
}
}
Ideal for:
Frequently asked questions
Product descriptions
Knowledge base queries
Code generation for common patterns
Development/Testing
{
"cache" : {
"mode" : "simple" ,
"max_age" : 86400000 // 24 hours
}
}
Benefits:
Faster test runs
Reduced API costs during development
Consistent responses for testing
Short TTL for Dynamic Content
{
"cache" : {
"mode" : "simple" ,
"max_age" : 300000 // 5 minutes
}
}
Use for:
News summaries
Real-time data with some staleness tolerance
Load reduction during traffic spikes
Combining with Other Features
Cache + Fallback
{
"cache" : { "mode" : "simple" , "max_age" : 3600000 },
"strategy" : { "mode" : "fallback" },
"targets" : [
{ "provider" : "openai" , "api_key" : "sk-..." },
{ "provider" : "anthropic" , "api_key" : "sk-ant-..." }
]
}
Cache is checked before routing, so cached responses bypass the entire routing logic.
Cache + Load Balancing
{
"cache" : { "mode" : "simple" , "max_age" : 1800000 },
"strategy" : { "mode" : "loadbalance" },
"targets" : [
{ "provider" : "openai" , "api_key" : "sk-1" , "weight" : 0.5 },
{ "provider" : "openai" , "api_key" : "sk-2" , "weight" : 0.5 }
]
}
Cache hits save costs across all providers.
Best Practices
Set Appropriate TTLs Balance freshness vs cost savings based on your use case.
Use Redis for Production Deploy Redis for persistent, distributed caching at scale.
Monitor Cache Metrics Track hit rates, memory usage, and cost savings.
Disable for Real-time Don’t cache time-sensitive or user-specific content.
Next Steps
Configs Learn how to configure caching in gateway configs.
Routing Understand how caching interacts with routing.
Load Balancing Combine caching with load balancing for optimal performance.
Deployment Set up Redis caching in production deployments.