Skip to main content

Overview

LLM Gateway includes built-in response caching using Redis. Identical requests return cached responses instantly, reducing latency from seconds to milliseconds and cutting costs by avoiding redundant API calls.

How It Works

Caching is automatic and transparent:
  1. Request arrives at the gateway
  2. Cache key is generated from the request payload
  3. Cache is checked for existing response
  4. If cached: Return response immediately (< 10ms)
  5. If not cached: Forward to provider, cache response, return to client
packages/cache/src/cache.ts
export function generateCacheKey(payload: Record<string, any>): string {
  return crypto
    .createHash("sha256")
    .update(JSON.stringify(payload))
    .digest("hex");
}

export async function getCache(key: string): Promise<any | null> {
  try {
    const cachedValue = await redisClient.get(key);
    if (!cachedValue) {
      return null;
    }
    return JSON.parse(cachedValue);
  } catch (error) {
    logger.error("Error getting cache:", error as Error);
    return null;
  }
}

export async function setCache(
  key: string,
  value: any,
  expirationSeconds: number,
): Promise<void> {
  try {
    await redisClient.set(key, JSON.stringify(value), "EX", expirationSeconds);
  } catch (error) {
    logger.error("Error setting cache:", error as Error);
  }
}

Cache Key Generation

The cache key is a SHA-256 hash of:
  • Model name
  • Messages (full conversation history)
  • Temperature
  • Max tokens
  • Top P
  • Frequency penalty
  • Presence penalty
  • Tools and tool choice
  • Response format
  • All other request parameters
Even tiny changes to the request (like adding a space) will result in a different cache key and cache miss.

Streaming Cache

Streaming responses are also cached:
packages/cache/src/cache.ts
interface StreamingCacheChunk {
  data: string;
  eventId: number;
  event?: string;
  timestamp: number;
}

interface StreamingCacheData {
  chunks: StreamingCacheChunk[];
  metadata: {
    model: string;
    provider: string;
    finishReason: string | null;
    totalChunks: number;
    duration: number;
    completed: boolean;
  };
}

export function generateStreamingCacheKey(
  payload: Record<string, any>,
): string {
  return `stream:${generateCacheKey(payload)}`;
}

export async function setStreamingCache(
  key: string,
  data: StreamingCacheData,
  expirationSeconds: number,
): Promise<void> {
  await redisClient.set(key, JSON.stringify(data), "EX", expirationSeconds);
}

Cache Expiration

Caches have configurable TTL (time-to-live):
// Default: 1 hour for non-streaming
const expirationSeconds = 3600;

// Streaming responses: 30 minutes
const streamingExpirationSeconds = 1800;
Configure cache TTL via environment variables:
  • CACHE_TTL_SECONDS - Non-streaming responses
  • STREAMING_CACHE_TTL_SECONDS - Streaming responses

Database Caching

LLM Gateway also caches database queries:
apps/gateway/src/lib/cached-queries.ts
export async function findApiKeyByToken(token: string) {
  const cacheKey = `api_key:${token}`;
  const cached = await getCache(cacheKey);
  
  if (cached) {
    return cached;
  }
  
  const apiKey = await db.query.apiKey.findFirst({
    where: { token: { eq: token }, status: { eq: "active" } }
  });
  
  // Cache for 5 minutes
  await setCache(cacheKey, apiKey, 300);
  
  return apiKey;
}
Cached queries:
  • API keys - 5 minute TTL
  • Projects - 5 minute TTL
  • Organizations - 5 minute TTL
  • Provider keys - 5 minute TTL
  • Custom provider keys - 5 minute TTL

When Caching Helps

Repeated Questions

Users asking the same question multiple times

Static Content

Generating content that doesn’t change often

Documentation

Answering common documentation questions

Autocomplete

Code completion with identical contexts

When Caching Doesn’t Help

Unique Requests

Every request is different (user-specific data)

Time-Sensitive

Responses need to be fresh (news, weather)

Creative Tasks

High temperature generates different outputs

Streaming Required

Client needs real-time streaming, not cached chunks

Detecting Cache Hits

Cache hits are indicated in the response:
{
  "id": "chatcmpl-xyz",
  "cached": true,
  "usage": {
    "prompt_tokens": 150,
    "completion_tokens": 50,
    "total_tokens": 200
  },
  "metadata": {
    "used_provider": "openai",
    "used_model": "gpt-4o"
  }
}

Disabling Cache

Caching is enabled by default for all requests. To disable:
# Set in environment
DISABLE_CACHE=true
Or check the source:
packages/db/src/queries.ts
export async function isCachingEnabled(
  organizationId: string
): Promise<boolean> {
  if (process.env.DISABLE_CACHE === "true") {
    return false;
  }
  
  // Check organization settings
  const org = await db.query.organization.findFirst({
    where: { id: { eq: organizationId } }
  });
  
  return org?.cachingEnabled ?? true;
}

Cache Performance

Cache Hit Rates

Monitor cache effectiveness:
SELECT 
  DATE(created_at) as date,
  COUNT(*) as total_requests,
  SUM(CASE WHEN cached THEN 1 ELSE 0 END) as cache_hits,
  ROUND(100.0 * SUM(CASE WHEN cached THEN 1 ELSE 0 END) / COUNT(*), 2) as hit_rate
FROM logs
WHERE organization_id = 'org_abc'
GROUP BY DATE(created_at)
ORDER BY date DESC;

Latency Improvement

Typical latencies:
ScenarioLatency
Cache hit< 10ms
Cache miss (GPT-4o)800-2000ms
Cache miss (Claude)1200-3000ms
Cache miss (Gemini)600-1500ms
Cache hits are 100-300x faster than calling the provider API.

Cost Savings

Cached responses don’t count toward your API usage:
if (cachedResponse) {
  // No provider API call = no cost
  // No token usage deducted
  return cachedResponse;
}

Example Savings

Cache Hit RateRequests/DayCost Without CacheCost With CacheSavings
30%10,000$50.00$35.00$15.00/day
50%10,000$50.00$25.00$25.00/day
70%10,000$50.00$15.00$35.00/day

Redis Configuration

LLM Gateway uses Redis for caching:
packages/cache/src/redis.ts
import { createClient } from "redis";

export const redisClient = createClient({
  url: process.env.REDIS_URL || "redis://localhost:6379",
  socket: {
    connectTimeout: 10000,
    reconnectStrategy: (retries) => {
      if (retries > 10) {
        return new Error("Redis reconnect failed");
      }
      return Math.min(retries * 100, 3000);
    },
  },
});

Self-Hosting

When self-hosting, configure Redis:
# docker-compose.yml
services:
  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    volumes:
      - redis-data:/data
    command: redis-server --maxmemory 2gb --maxmemory-policy allkeys-lru

volumes:
  redis-data:

Cache Invalidation

Caches automatically expire based on TTL. Manual invalidation:
# Clear all caches
redis-cli FLUSHDB

# Clear specific cache key
redis-cli DEL "stream:abc123..."

# Clear all streaming caches
redis-cli --scan --pattern "stream:*" | xargs redis-cli DEL

Best Practices

Use Lower Temperature

Temperature = 0 ensures deterministic responses

Normalize Inputs

Trim whitespace and normalize formatting

Monitor Hit Rates

Track cache effectiveness in analytics

Set Appropriate TTL

Balance freshness vs. cache hits

Limitations

  • Cache size - Limited by available Redis memory
  • Cache misses - Any parameter change invalidates cache
  • No partial matching - Exact match required
  • Test mode - Caching disabled when NODE_ENV=test

Build docs developers (and LLMs) love