Caching - Portkey AI Gateway

The Portkey AI Gateway includes built-in caching to reduce latency and API costs by storing and reusing LLM responses. The caching system supports both simple and semantic caching modes.

How Caching Works

When caching is enabled, the gateway:

Generates a cache key from the request body and URL
Checks for a cached response before making the provider request
Returns the cached response if found and not expired
Makes the provider request if cache miss
Stores the response in cache for future requests

Cache Configuration

Enable Caching

Caching can be enabled globally or per-request: Global configuration (conf.json):

{
  "cache": true
}

Per-request configuration:

{
  "provider": "openai",
  "api_key": "sk-...",
  "cache": {
    "mode": "simple",
    "max_age": 3600000
  }
}

Cache Modes

Simple Caching

Caches based on exact request match:

{
  "cache": {
    "mode": "simple",
    "max_age": 3600000  // 1 hour in milliseconds
  }
}

How it works:

Request body and URL are hashed to create a cache key
Exact match required for cache hit
TTL (time-to-live) specified by max_age
Default TTL is 24 hours (86,400,000 ms) if not specified

Semantic Caching

Semantic caching is available in the hosted and enterprise versions. The open-source gateway currently supports simple caching only.

Semantic caching matches requests with similar meaning:

{
  "cache": {
    "mode": "semantic",
    "max_age": 7200000  // 2 hours
  }
}

Cache Key Generation

The cache key is generated using SHA-256 hashing:

const getCacheKey = async (requestBody: any, url: string) => {
  const stringToHash = `${JSON.stringify(requestBody)}-${url}`;
  const encoded = new TextEncoder().encode(stringToHash);
  const hash = await crypto.subtle.digest({ name: 'SHA-256' }, encoded);
  return Array.from(new Uint8Array(hash))
    .map(b => b.toString(16).padStart(2, '0'))
    .join('');
}

Source: src/middlewares/cache/index.ts:14-26 What’s included in the key:

Complete request body (model, messages, temperature, etc.)
Request URL (provider endpoint)
All request parameters

What’s excluded:

Request headers (except those in the body)
Timestamps
User metadata

Cache Storage

In-Memory Cache

The default implementation uses in-memory storage:

const inMemoryCache: Record<string, CacheEntry> = {};

interface CacheEntry {
  responseBody: string;  // Serialized JSON response
  maxAge: number;        // Expiration timestamp
}

Source: src/middlewares/cache/index.ts:3 Characteristics:

Fast access (no network calls)
Lost on gateway restart
Not shared across gateway instances
Limited by available memory

Redis Cache

Redis caching is available in Node.js runtime when REDIS_CONNECTION_STRING is set.

For production deployments, use Redis for persistent, distributed caching:

REDIS_CONNECTION_STRING=redis://localhost:6379 npm start

Redis benefits:

Persistent across restarts
Shared across multiple gateway instances
Configurable eviction policies
Support for large cache sizes

Source: src/index.ts:49-51

Cache Lifecycle

Retrieving from Cache

const getFromCache = async (
  env: any,
  requestHeaders: any,
  requestBody: any,
  url: string,
  organisationId: string,
  cacheMode: string,
  cacheMaxAge: number | null
) => {
  // Check for force refresh header
  if ('x-portkey-cache-force-refresh' in requestHeaders) {
    return [null, CACHE_STATUS.REFRESH, null];
  }
  
  const cacheKey = await getCacheKey(requestBody, url);
  
  if (cacheKey in inMemoryCache) {
    const cacheObject = inMemoryCache[cacheKey];
    
    // Check expiration
    if (cacheObject.maxAge && cacheObject.maxAge < Date.now()) {
      delete inMemoryCache[cacheKey];
      return [null, CACHE_STATUS.MISS, null];
    }
    
    return [cacheObject.responseBody, CACHE_STATUS.HIT, cacheKey];
  }
  
  return [null, CACHE_STATUS.MISS, null];
}

Source: src/middlewares/cache/index.ts:29-58

Storing in Cache

const putInCache = async (
  env: any,
  requestHeaders: any,
  requestBody: any,
  responseBody: any,
  url: string,
  organisationId: string,
  cacheMode: string | null,
  cacheMaxAge: number | null
) => {
  // Don't cache streaming responses
  if (requestBody.stream) {
    return;
  }
  
  const cacheKey = await getCacheKey(requestBody, url);
  
  inMemoryCache[cacheKey] = {
    responseBody: JSON.stringify(responseBody),
    maxAge: cacheMaxAge || Date.now() + (24 * 60 * 60 * 1000)
  };
}

Source: src/middlewares/cache/index.ts:60-82

Cache Status

The gateway tracks cache operations with status codes:

const CACHE_STATUS = {
  HIT: 'HIT',                    // Response served from cache
  SEMANTIC_HIT: 'SEMANTIC HIT',  // Semantic match found
  MISS: 'MISS',                  // No cache entry found
  SEMANTIC_MISS: 'SEMANTIC MISS',// No semantic match
  REFRESH: 'REFRESH',            // Force refresh requested
  DISABLED: 'DISABLED'           // Caching disabled
};

Source: src/middlewares/cache/index.ts:5-12

Cache Invalidation

Force Refresh

Bypass cache and fetch fresh response:

curl https://gateway.portkey.ai/v1/chat/completions \
  -H "x-portkey-cache-force-refresh: true" \
  -H "x-portkey-config: {...}" \
  -d '{...}'

This returns a fresh response and updates the cache.

Expiration-based

Entries automatically expire when max_age is reached:

{
  "cache": {
    "mode": "simple",
    "max_age": 1800000  // 30 minutes
  }
}

Manual Invalidation

For in-memory cache, entries are removed on:

Gateway restart
Expiration check during retrieval

For Redis cache:

Configure TTL in Redis
Use Redis commands to flush keys
Implement custom invalidation logic

Streaming and Caching

Streaming responses cannot be cached. The cache middleware automatically skips streaming requests.

if (requestBody.stream) {
  // Does not support caching of streams
  return;
}

Reason: Streaming responses are sent incrementally as Server-Sent Events (SSE), making them incompatible with response caching. Source: src/middlewares/cache/index.ts:70-73

Cache TTL Calculation

The effective TTL is calculated as:

const effectiveTTL = cacheMaxAge || (24 * 60 * 60 * 1000);
const expirationTime = Date.now() + effectiveTTL;

Default TTL: 24 hours (86,400,000 milliseconds) Minimum TTL: No minimum enforced Maximum TTL: No maximum enforced (be mindful of stale data) Source: src/middlewares/cache/index.ts:107-108

Cache Middleware Integration

The cache middleware is registered in the request pipeline:

if (conf.cache === true) {
  app.use('*', memoryCache());
}

Source: src/index.ts:108-110 The middleware:

Runs on all routes ('*')
Executes before request handlers
Stores responses after successful completion

Performance Considerations

Cache Hit Rate

Maximize cache effectiveness:

Use consistent request parameters
Avoid random seeds or IDs in requests
Group similar requests
Set appropriate TTLs

Memory Usage

Monitor memory consumption:

// Each cache entry size:
const entrySize = 
  JSON.stringify(responseBody).length +  // Response size
  64 +                                    // Cache key (SHA-256)
  8;                                      // maxAge timestamp

Best practices:

Set reasonable max_age values
Use Redis for large caches
Monitor cache size metrics
Implement cache size limits

Latency Impact

Cache hit: ~0.1-1ms (in-memory) or ~1-5ms (Redis) Cache miss: Full provider latency + cache write time Cache key generation: ~0.1ms (SHA-256 hash)

Use Cases

Repeated Queries

{
  "cache": {
    "mode": "simple",
    "max_age": 3600000  // 1 hour
  }
}

Ideal for:

Frequently asked questions
Product descriptions
Knowledge base queries
Code generation for common patterns

Development/Testing

{
  "cache": {
    "mode": "simple",
    "max_age": 86400000  // 24 hours
  }
}

Benefits:

Faster test runs
Reduced API costs during development
Consistent responses for testing

Short TTL for Dynamic Content

{
  "cache": {
    "mode": "simple",
    "max_age": 300000  // 5 minutes
  }
}

Use for:

News summaries
Real-time data with some staleness tolerance
Load reduction during traffic spikes

Combining with Other Features

Cache + Fallback

{
  "cache": { "mode": "simple", "max_age": 3600000 },
  "strategy": { "mode": "fallback" },
  "targets": [
    { "provider": "openai", "api_key": "sk-..." },
    { "provider": "anthropic", "api_key": "sk-ant-..." }
  ]
}

Cache is checked before routing, so cached responses bypass the entire routing logic.

Cache + Load Balancing

{
  "cache": { "mode": "simple", "max_age": 1800000 },
  "strategy": { "mode": "loadbalance" },
  "targets": [
    { "provider": "openai", "api_key": "sk-1", "weight": 0.5 },
    { "provider": "openai", "api_key": "sk-2", "weight": 0.5 }
  ]
}

Cache hits save costs across all providers.

Best Practices

Set Appropriate TTLs

Balance freshness vs cost savings based on your use case.

Use Redis for Production

Deploy Redis for persistent, distributed caching at scale.

Monitor Cache Metrics

Track hit rates, memory usage, and cost savings.

Disable for Real-time

Don’t cache time-sensitive or user-specific content.

Next Steps

Configs

Learn how to configure caching in gateway configs.

Routing

Understand how caching interacts with routing.

Load Balancing

Combine caching with load balancing for optimal performance.

Deployment

Set up Redis caching in production deployments.

Getting Started

Core Concepts

Features

MCP Gateway

Deployment

​How Caching Works

​Cache Configuration

​Enable Caching

​Cache Modes

​Simple Caching

​Semantic Caching

​Cache Key Generation

​Cache Storage

​In-Memory Cache

​Redis Cache

​Cache Lifecycle

​Retrieving from Cache

​Storing in Cache

​Cache Status

​Cache Invalidation

​Force Refresh

​Expiration-based

​Manual Invalidation

​Streaming and Caching

​Cache TTL Calculation

​Cache Middleware Integration

​Performance Considerations

​Cache Hit Rate

​Memory Usage

​Latency Impact

​Use Cases

​Repeated Queries

​Development/Testing

​Short TTL for Dynamic Content

​Combining with Other Features

​Cache + Fallback

​Cache + Load Balancing

​Best Practices

Set Appropriate TTLs

Use Redis for Production

Monitor Cache Metrics

Disable for Real-time

​Next Steps

Configs

Routing

Load Balancing

Deployment

Build docs developers (and LLMs) love

How Caching Works

Cache Configuration

Enable Caching

Cache Modes

Simple Caching

Semantic Caching

Cache Key Generation

Cache Storage

In-Memory Cache

Redis Cache

Cache Lifecycle

Retrieving from Cache

Storing in Cache

Cache Status

Cache Invalidation

Force Refresh

Expiration-based

Manual Invalidation

Streaming and Caching

Cache TTL Calculation

Cache Middleware Integration

Performance Considerations

Cache Hit Rate

Memory Usage

Latency Impact

Use Cases

Repeated Queries

Development/Testing

Short TTL for Dynamic Content

Combining with Other Features

Cache + Fallback

Cache + Load Balancing

Best Practices

Next Steps