Response Caching

Overview

LLM Gateway includes built-in response caching using Redis. Identical requests return cached responses instantly, reducing latency from seconds to milliseconds and cutting costs by avoiding redundant API calls.

How It Works

Caching is automatic and transparent:

Request arrives at the gateway
Cache key is generated from the request payload
Cache is checked for existing response
If cached: Return response immediately (< 10ms)
If not cached: Forward to provider, cache response, return to client

packages/cache/src/cache.ts

export function generateCacheKey(payload: Record<string, any>): string {
  return crypto
    .createHash("sha256")
    .update(JSON.stringify(payload))
    .digest("hex");
}

export async function getCache(key: string): Promise<any | null> {
  try {
    const cachedValue = await redisClient.get(key);
    if (!cachedValue) {
      return null;
    }
    return JSON.parse(cachedValue);
  } catch (error) {
    logger.error("Error getting cache:", error as Error);
    return null;
  }
}

export async function setCache(
  key: string,
  value: any,
  expirationSeconds: number,
): Promise<void> {
  try {
    await redisClient.set(key, JSON.stringify(value), "EX", expirationSeconds);
  } catch (error) {
    logger.error("Error setting cache:", error as Error);
  }
}

Cache Key Generation

The cache key is a SHA-256 hash of:

Model name
Messages (full conversation history)
Temperature
Max tokens
Top P
Frequency penalty
Presence penalty
Tools and tool choice
Response format
All other request parameters

Even tiny changes to the request (like adding a space) will result in a different cache key and cache miss.

Streaming Cache

Streaming responses are also cached:

packages/cache/src/cache.ts

interface StreamingCacheChunk {
  data: string;
  eventId: number;
  event?: string;
  timestamp: number;
}

interface StreamingCacheData {
  chunks: StreamingCacheChunk[];
  metadata: {
    model: string;
    provider: string;
    finishReason: string | null;
    totalChunks: number;
    duration: number;
    completed: boolean;
  };
}

export function generateStreamingCacheKey(
  payload: Record<string, any>,
): string {
  return `stream:${generateCacheKey(payload)}`;
}

export async function setStreamingCache(
  key: string,
  data: StreamingCacheData,
  expirationSeconds: number,
): Promise<void> {
  await redisClient.set(key, JSON.stringify(data), "EX", expirationSeconds);
}

Cache Expiration

Caches have configurable TTL (time-to-live):

// Default: 1 hour for non-streaming
const expirationSeconds = 3600;

// Streaming responses: 30 minutes
const streamingExpirationSeconds = 1800;

Configure cache TTL via environment variables:

CACHE_TTL_SECONDS - Non-streaming responses
STREAMING_CACHE_TTL_SECONDS - Streaming responses

Database Caching

LLM Gateway also caches database queries:

apps/gateway/src/lib/cached-queries.ts

export async function findApiKeyByToken(token: string) {
  const cacheKey = `api_key:${token}`;
  const cached = await getCache(cacheKey);
  
  if (cached) {
    return cached;
  }
  
  const apiKey = await db.query.apiKey.findFirst({
    where: { token: { eq: token }, status: { eq: "active" } }
  });
  
  // Cache for 5 minutes
  await setCache(cacheKey, apiKey, 300);
  
  return apiKey;
}

Cached queries:

API keys - 5 minute TTL
Projects - 5 minute TTL
Organizations - 5 minute TTL
Provider keys - 5 minute TTL
Custom provider keys - 5 minute TTL

When Caching Helps

Repeated Questions

Users asking the same question multiple times

Static Content

Generating content that doesn’t change often

Documentation

Answering common documentation questions

Autocomplete

Code completion with identical contexts

When Caching Doesn’t Help

Unique Requests

Every request is different (user-specific data)

Time-Sensitive

Responses need to be fresh (news, weather)

Creative Tasks

High temperature generates different outputs

Streaming Required

Client needs real-time streaming, not cached chunks

Detecting Cache Hits

Cache hits are indicated in the response:

{
  "id": "chatcmpl-xyz",
  "cached": true,
  "usage": {
    "prompt_tokens": 150,
    "completion_tokens": 50,
    "total_tokens": 200
  },
  "metadata": {
    "used_provider": "openai",
    "used_model": "gpt-4o"
  }
}

Disabling Cache

Caching is enabled by default for all requests. To disable:

# Set in environment
DISABLE_CACHE=true

Or check the source:

packages/db/src/queries.ts

export async function isCachingEnabled(
  organizationId: string
): Promise<boolean> {
  if (process.env.DISABLE_CACHE === "true") {
    return false;
  }
  
  // Check organization settings
  const org = await db.query.organization.findFirst({
    where: { id: { eq: organizationId } }
  });
  
  return org?.cachingEnabled ?? true;
}

Cache Performance

Cache Hit Rates

Monitor cache effectiveness:

SELECT 
  DATE(created_at) as date,
  COUNT(*) as total_requests,
  SUM(CASE WHEN cached THEN 1 ELSE 0 END) as cache_hits,
  ROUND(100.0 * SUM(CASE WHEN cached THEN 1 ELSE 0 END) / COUNT(*), 2) as hit_rate
FROM logs
WHERE organization_id = 'org_abc'
GROUP BY DATE(created_at)
ORDER BY date DESC;

Latency Improvement

Typical latencies:

Scenario	Latency
Cache hit	< 10ms
Cache miss (GPT-4o)	800-2000ms
Cache miss (Claude)	1200-3000ms
Cache miss (Gemini)	600-1500ms

Cache hits are 100-300x faster than calling the provider API.

Cost Savings

Cached responses don’t count toward your API usage:

if (cachedResponse) {
  // No provider API call = no cost
  // No token usage deducted
  return cachedResponse;
}

Example Savings

Cache Hit Rate	Requests/Day	Cost Without Cache	Cost With Cache	Savings
30%	10,000	$50.00	$35.00	$15.00/day
50%	10,000	$50.00	$25.00	$25.00/day
70%	10,000	$50.00	$15.00	$35.00/day

Redis Configuration

LLM Gateway uses Redis for caching:

packages/cache/src/redis.ts

import { createClient } from "redis";

export const redisClient = createClient({
  url: process.env.REDIS_URL || "redis://localhost:6379",
  socket: {
    connectTimeout: 10000,
    reconnectStrategy: (retries) => {
      if (retries > 10) {
        return new Error("Redis reconnect failed");
      }
      return Math.min(retries * 100, 3000);
    },
  },
});

Self-Hosting

When self-hosting, configure Redis:

# docker-compose.yml
services:
  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    volumes:
      - redis-data:/data
    command: redis-server --maxmemory 2gb --maxmemory-policy allkeys-lru

volumes:
  redis-data:

Cache Invalidation

Caches automatically expire based on TTL. Manual invalidation:

# Clear all caches
redis-cli FLUSHDB

# Clear specific cache key
redis-cli DEL "stream:abc123..."

# Clear all streaming caches
redis-cli --scan --pattern "stream:*" | xargs redis-cli DEL

Best Practices

Use Lower Temperature

Temperature = 0 ensures deterministic responses

Normalize Inputs

Trim whitespace and normalize formatting

Monitor Hit Rates

Track cache effectiveness in analytics

Set Appropriate TTL

Balance freshness vs. cache hits

Limitations

Cache size - Limited by available Redis memory
Cache misses - Any parameter change invalidates cache
No partial matching - Exact match required
Test mode - Caching disabled when NODE_ENV=test

Get Started

Core Features

Guides

Integrations

Overview

How It Works

Cache Key Generation

Streaming Cache

Cache Expiration

Database Caching

When Caching Helps

Repeated Questions

Static Content

Documentation

Autocomplete

When Caching Doesn’t Help

Unique Requests

Time-Sensitive

Creative Tasks

Streaming Required

Detecting Cache Hits

Disabling Cache

Cache Performance

Cache Hit Rates

Latency Improvement

Cost Savings

Example Savings

Redis Configuration

Self-Hosting

Cache Invalidation

Best Practices

Use Lower Temperature

Normalize Inputs

Monitor Hit Rates

Set Appropriate TTL

Limitations

Build docs developers (and LLMs) love

Get Started

Core Features

Guides

Integrations

​Overview

​How It Works

​Cache Key Generation

​Streaming Cache

​Cache Expiration

​Database Caching

​When Caching Helps

Repeated Questions

Static Content

Documentation

Autocomplete

​When Caching Doesn’t Help

Unique Requests

Time-Sensitive

Creative Tasks

Streaming Required

​Detecting Cache Hits

​Disabling Cache

​Cache Performance

​Cache Hit Rates

​Latency Improvement

​Cost Savings

​Example Savings

​Redis Configuration

​Self-Hosting

​Cache Invalidation

​Best Practices

Use Lower Temperature

Normalize Inputs

Monitor Hit Rates

Set Appropriate TTL

​Limitations

​Related Documentation

Build docs developers (and LLMs) love

Overview

How It Works

Cache Key Generation

Streaming Cache

Cache Expiration

Database Caching

When Caching Helps

When Caching Doesn’t Help

Detecting Cache Hits

Disabling Cache

Cache Performance

Cache Hit Rates

Latency Improvement

Cost Savings

Example Savings

Redis Configuration

Self-Hosting

Cache Invalidation

Best Practices

Limitations

Related Documentation