LLM Caching

When developing and testing LLM applications, you often make the same requests repeatedly during debugging and iteration. Helicone caching stores complete responses on Cloudflare’s edge network, eliminating redundant API calls and reducing both latency and costs.

Looking for provider-level caching? Helicone also supports prompt caching directly on provider servers (OpenAI, Anthropic, etc.) for reduced token costs.

Why Helicone Caching

Save Money

Avoid repeated charges for identical requests while testing and debugging

Instant Responses

Serve cached responses immediately instead of waiting for LLM providers

Handle Traffic Spikes

Protect against rate limits and maintain performance during high usage

How It Works

Helicone’s caching system stores LLM responses on Cloudflare’s edge network, providing globally distributed, low-latency access to cached data.

Cache Key Generation

Helicone generates unique cache keys by hashing:

Cache seed - Optional namespace identifier (if specified)
Request URL - The full endpoint URL
Request body - Complete request payload including all parameters
Relevant headers - Authorization and cache-specific headers
Bucket index - For multi-response caching

Any change in these components creates a new cache entry:

// ✅ Cache hit - identical requests
const request1 = { model: "gpt-4o-mini", messages: [{ role: "user", content: "Hello" }] };
const request2 = { model: "gpt-4o-mini", messages: [{ role: "user", content: "Hello" }] };

// ❌ Cache miss - different content  
const request3 = { model: "gpt-4o-mini", messages: [{ role: "user", content: "Hi" }] };

// ❌ Cache miss - different parameters
const request4 = { model: "gpt-4o-mini", messages: [{ role: "user", content: "Hello" }], temperature: 0.5 };

Cache Storage

Responses are stored in Cloudflare Workers KV (key-value store)
Distributed across 300+ global edge locations
Automatic replication and failover
No impact on your infrastructure

Quick Start

Enable caching

Add the Helicone-Cache-Enabled header to your requests:

{
  "Helicone-Cache-Enabled": "true"
}

Make your request

Execute your LLM request - the first call will be cached:

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://ai-gateway.helicone.ai",
  apiKey: process.env.HELICONE_API_KEY,
});

const response = await client.chat.completions.create(
  {
    model: "gpt-4o-mini",
    messages: [{ role: "user", content: "Hello world" }]
  },
  {
    headers: {
      "Helicone-Cache-Enabled": "true"
    }
  }
);

Verify caching works

Make the same request again - it should return instantly from cache:

// This exact same request will return a cached response
const cachedResponse = await client.chat.completions.create(
  {
    model: "gpt-4o-mini", 
    messages: [{ role: "user", content: "Hello world" }]
  },
  {
    headers: {
      "Helicone-Cache-Enabled": "true"
    }
  }
);

Configuration

Helicone-Cache-Enabled

string

required

Enable or disable caching for the request.Example: "true" to enable caching

Cache-Control

string

Set cache duration using standard HTTP cache control directives.Default: "max-age=604800" (7 days)Example: "max-age=3600" for 1 hour cache

Helicone-Cache-Bucket-Max-Size

string

Number of different responses to store for the same request. Useful for non-deterministic prompts.Default: "1" (single response cached)Example: "3" to cache up to 3 different responses

Helicone-Cache-Seed

string

Create separate cache namespaces for different users or contexts.Example: "user-123" to maintain user-specific cache

Helicone-Cache-Ignore-Keys

string

Comma-separated JSON keys to exclude from cache key generation.Example: "request_id,timestamp" to ignore these fields when generating cache keys

All header values must be strings. For example, "Helicone-Cache-Bucket-Max-Size": "10".

Examples

Development Testing
User-Specific Caching

Avoid repeated charges while debugging and iterating on prompts:

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://ai-gateway.helicone.ai",
  apiKey: process.env.HELICONE_API_KEY,
  defaultHeaders: {
    "Helicone-Cache-Enabled": "true",
    "Cache-Control": "max-age=86400" // Cache for 1 day during development
  },
});

// This request will be cached - works with any model
const response = await client.chat.completions.create({
  model: "gpt-4o-mini",  // or "claude-3.5-sonnet", "gemini-2.5-flash", etc.
  messages: [{ role: "user", content: "Explain quantum computing" }]
});

// Subsequent identical requests return cached response instantly

Cache responses separately for different users or contexts:

const userId = "user-123";

const response = await client.chat.completions.create(
  {
    model: "claude-3.5-sonnet",
    messages: [{ 
      role: "user", 
      content: "What are my account settings?" 
    }]
  },
  {
    headers: {
      "Helicone-Cache-Enabled": "true",
      "Helicone-Cache-Seed": userId,           // User-specific cache
      "Cache-Control": "max-age=3600"          // Cache for 1 hour
    }
  }
);

// Each user gets their own cached responses

Helicone Dashboard showing the number of cache hits, cost, and time saved.

Understanding Caching

Cache Response Headers

Check cache status by examining response headers:

const response = await client.chat.completions.create(
  { /* your request */ },
  { 
    headers: { "Helicone-Cache-Enabled": "true" }
  }
);

// Access raw response to check headers
const chatCompletion = await client.chat.completions.with_raw_response.create(
  { /* your request */ },
  { 
    headers: { "Helicone-Cache-Enabled": "true" }
  }
);

const cacheStatus = chatCompletion.http_response.headers.get('Helicone-Cache');
console.log(cacheStatus); // "HIT" or "MISS"

const bucketIndex = chatCompletion.http_response.headers.get('Helicone-Cache-Bucket-Idx');
console.log(bucketIndex); // Index of cached response used

Cache Duration

Set how long responses stay cached using the Cache-Control header:

{
  "Cache-Control": "max-age=3600"  // 1 hour
}

Common durations:

1 hour: max-age=3600
1 day: max-age=86400
7 days: max-age=604800 (default)
30 days: max-age=2592000

Maximum cache duration is 365 days (max-age=31536000)

Cache Buckets

Control how many different responses are stored for the same request:

{
  "Helicone-Cache-Bucket-Max-Size": "3"
}

With bucket size 3, the same request can return one of 3 different cached responses randomly:

openai.completion("give me a random number") -> "42"  # Cache Miss
openai.completion("give me a random number") -> "47"  # Cache Miss  
openai.completion("give me a random number") -> "17"  # Cache Miss

openai.completion("give me a random number") -> "42" | "47" | "17"  # Cache Hit

Behavior by bucket size:

Size 1 (default): Same request always returns same cached response (deterministic)
Size > 1: Same request can return different cached responses (useful for creative prompts)
Response chosen randomly from bucket

Maximum bucket size is 20. Enterprise plans support larger buckets.

Cache Seeds

Create separate cache namespaces using seeds:

{
  "Helicone-Cache-Seed": "user-123"
}

Different seeds maintain separate cache states:

# Seed: "user-123"
openai.completion("random number") -> "42"
openai.completion("random number") -> "42"  # Same response

# Seed: "user-456"  
openai.completion("random number") -> "17"  # Different response
openai.completion("random number") -> "17"  # Consistent per seed

Change the seed value to effectively clear your cache for testing.

Cost Savings

Caching can dramatically reduce your LLM costs:

Example Savings

Development
FAQ Bot
Testing Suite

Scenario: Testing a feature 100 times during development

Without caching: 100 requests × $0.002 = **$ 0.20**
With caching: 1 request × $0.002 + 99 cached = **$ 0.002**
Savings: 99%

Monitoring Savings

Track your cache performance in the Helicone dashboard:

Cache hit rate - Percentage of requests served from cache
Cost savings - Total dollars saved from cached responses
Time saved - Cumulative latency reduction
Top cached requests - Most frequently cached queries

Prompt Caching

Cache prompts on provider servers for reduced token costs and faster processing

Custom Properties

Add metadata to cached requests for better filtering and analysis

Rate Limiting

Control request frequency and combine with caching for cost optimization

User Metrics

Track cache hit rates and savings per user or application

Questions?

If you have any questions or need help, please reach out to us:

Join our Discord community
Email us at [email protected]
Check out our GitHub repository

Getting Started

AI Gateway

Observability

Prompt Management

Features

Integrations

Self-Hosting

Why Helicone Caching

Save Money

Instant Responses

Handle Traffic Spikes

How It Works

Cache Key Generation

Cache Storage

Quick Start

Configuration

Examples

Understanding Caching

Cache Response Headers

Cache Duration

Cache Buckets

Cache Seeds

Cost Savings

Example Savings

Monitoring Savings

Prompt Caching

Custom Properties

Rate Limiting

User Metrics

Questions?

Build docs developers (and LLMs) love

Getting Started

AI Gateway

Observability

Prompt Management

Features

Integrations

Self-Hosting

​Why Helicone Caching

Save Money

Instant Responses

Handle Traffic Spikes

​How It Works

​Cache Key Generation

​Cache Storage

​Quick Start

​Configuration

​Examples

​Understanding Caching

​Cache Response Headers

​Cache Duration

​Cache Buckets

​Cache Seeds

​Cost Savings

​Example Savings

​Monitoring Savings

​Related Features

Prompt Caching

Custom Properties

Rate Limiting

User Metrics

​Questions?

Build docs developers (and LLMs) love

Why Helicone Caching

How It Works

Cache Key Generation

Cache Storage

Quick Start

Configuration

Examples

Understanding Caching

Cache Response Headers

Cache Duration

Cache Buckets

Cache Seeds

Cost Savings

Example Savings

Monitoring Savings

Related Features

Questions?