Looking for provider-level caching? Helicone also supports prompt caching directly on provider servers (OpenAI, Anthropic, etc.) for reduced token costs.
Why Helicone Caching
Save Money
Avoid repeated charges for identical requests while testing and debugging
Instant Responses
Serve cached responses immediately instead of waiting for LLM providers
Handle Traffic Spikes
Protect against rate limits and maintain performance during high usage
How It Works
Helicone’s caching system stores LLM responses on Cloudflare’s edge network, providing globally distributed, low-latency access to cached data.Cache Key Generation
Helicone generates unique cache keys by hashing:- Cache seed - Optional namespace identifier (if specified)
- Request URL - The full endpoint URL
- Request body - Complete request payload including all parameters
- Relevant headers - Authorization and cache-specific headers
- Bucket index - For multi-response caching
Cache Storage
- Responses are stored in Cloudflare Workers KV (key-value store)
- Distributed across 300+ global edge locations
- Automatic replication and failover
- No impact on your infrastructure
Quick Start
Configuration
Enable or disable caching for the request.Example:
"true" to enable cachingSet cache duration using standard HTTP cache control directives.Default:
"max-age=604800" (7 days)Example: "max-age=3600" for 1 hour cacheNumber of different responses to store for the same request. Useful for non-deterministic prompts.Default:
"1" (single response cached)Example: "3" to cache up to 3 different responsesCreate separate cache namespaces for different users or contexts.Example:
"user-123" to maintain user-specific cacheComma-separated JSON keys to exclude from cache key generation.Example:
"request_id,timestamp" to ignore these fields when generating cache keysAll header values must be strings. For example,
"Helicone-Cache-Bucket-Max-Size": "10".Examples
- Development Testing
- User-Specific Caching
Avoid repeated charges while debugging and iterating on prompts:
Understanding Caching
Cache Response Headers
Check cache status by examining response headers:Cache Duration
Set how long responses stay cached using theCache-Control header:
- 1 hour:
max-age=3600 - 1 day:
max-age=86400 - 7 days:
max-age=604800(default) - 30 days:
max-age=2592000
Maximum cache duration is 365 days (
max-age=31536000)Cache Buckets
Control how many different responses are stored for the same request:- Size 1 (default): Same request always returns same cached response (deterministic)
- Size > 1: Same request can return different cached responses (useful for creative prompts)
- Response chosen randomly from bucket
Maximum bucket size is 20. Enterprise plans support larger buckets.
Cache Seeds
Create separate cache namespaces using seeds:Cost Savings
Caching can dramatically reduce your LLM costs:Example Savings
- Development
- FAQ Bot
- Testing Suite
Scenario: Testing a feature 100 times during development
- Without caching: 100 requests × 0.20**
- With caching: 1 request × 0.002**
- Savings: 99%
Monitoring Savings
Track your cache performance in the Helicone dashboard:- Cache hit rate - Percentage of requests served from cache
- Cost savings - Total dollars saved from cached responses
- Time saved - Cumulative latency reduction
- Top cached requests - Most frequently cached queries
Related Features
Prompt Caching
Cache prompts on provider servers for reduced token costs and faster processing
Custom Properties
Add metadata to cached requests for better filtering and analysis
Rate Limiting
Control request frequency and combine with caching for cost optimization
User Metrics
Track cache hit rates and savings per user or application
Questions?
If you have any questions or need help, please reach out to us:- Join our Discord community
- Email us at [email protected]
- Check out our GitHub repository