Prompt caching allows LLM providers to cache parts of your prompt that don’t change between requests, significantly reducing costs and latency for repeated queries.
How Prompt Caching Works
When you send a request to an LLM, providers like Anthropic can cache specific message segments. On subsequent requests with the same cached content, you:
- Pay reduced rates for cached tokens (often 90% cheaper)
- Experience faster response times (cached content is pre-processed)
- Reduce computational overhead
Providers supporting prompt caching:
- Anthropic: Claude models with
cache_control metadata
- OpenAI: GPT models with automatic caching
- Google AI: Gemini models with caching support
- Vertex AI: Google Cloud models with caching
AWS Bedrock does not currently report cached token usage, though some underlying models may use caching internally.
Enabling Prompt Caching in BAML
Caching is configured using role metadata in your BAML prompts. Here’s how to enable it:
Add allowed_role_metadata to your client configuration:
client<llm> AnthropicClient {
provider "anthropic"
options {
model "claude-sonnet-4-5-20250929"
allowed_role_metadata ["cache_control"]
}
}
Step 2: Add Cache Control to Messages
Use the _.role() function to add cache control metadata:
function AnalyzeBook(book: string, question: string) -> string {
client AnthropicClient
prompt #"
{{ _.role("user") }}
{{ book }}
{{ _.role("user", cache_control={"type": "ephemeral"}) }}
{{ question }}
"#
}
This creates two user messages:
- First message: Contains the book content (will be cached)
- Second message: Contains the question (marked with cache control)
The cache control metadata tells the provider that everything up to this point should be cached. Place it after the content you want to cache.
Real-World Examples
Analyzing Large Documents
Cache document content while varying questions:
class Analysis {
themes string[]
characters string[]
summary string
}
function AnalyzeLiterature(document: string, focus: string) -> Analysis {
client AnthropicClient
prompt #"
{{ _.role("user") }}
Here is the complete text to analyze:
{{ document }}
{{ _.role("user", cache_control={"type": "ephemeral"}) }}
Focus your analysis on: {{ focus }}
{{ ctx.output_format }}
"#
}
Usage:
from baml_client import b
# First call - caches document
result1 = await b.AnalyzeLiterature(
document=pride_and_prejudice_text,
focus="character development"
)
# Second call - reuses cached document
result2 = await b.AnalyzeLiterature(
document=pride_and_prejudice_text, # Same document
focus="social commentary" # Different focus
)
Multi-Turn Conversations with Context
Cache conversation history and system context:
class Message {
role string
content string
}
class Response {
answer string
confidence float
}
function ChatWithContext(
system_context: string,
conversation_history: Message[],
user_message: string
) -> Response {
client AnthropicClient
prompt #"
{{ _.role("system") }}
{{ system_context }}
{{ _.role("user") }}
Previous conversation:
{% for msg in conversation_history %}
{{ msg.role }}: {{ msg.content }}
{% endfor %}
{{ _.role("user", cache_control={"type": "ephemeral"}) }}
{{ user_message }}
{{ ctx.output_format }}
"#
}
RAG with Cached Context
Cache retrieved documents while varying queries:
class DocumentChunk {
title string
content string
}
class Answer {
response string
sources int[] // Indices of relevant chunks
confidence float
}
function AnswerFromDocs(
docs: DocumentChunk[],
query: string
) -> Answer {
client AnthropicClient
prompt #"
{{ _.role("user") }}
Reference Documents:
{% for doc in docs %}
[{{ loop.index }}] {{ doc.title }}
{{ doc.content }}
{% endfor %}
{{ _.role("user", cache_control={"type": "ephemeral"}) }}
Query: {{ query }}
Provide an answer based on the reference documents above.
{{ ctx.output_format }}
"#
}
Use Collectors to track cache hits and savings:
from baml_client import b
from baml_py import Collector
async def analyze_with_caching():
collector = Collector(name="cache-monitor")
# First call
await b.AnalyzeLiterature(
document=large_doc,
focus="themes",
baml_options={"collector": collector}
)
# Second call with same doc
await b.AnalyzeLiterature(
document=large_doc,
focus="characters",
baml_options={"collector": collector}
)
# Analyze cache performance
for i, log in enumerate(collector.logs):
print(f"\nCall {i + 1}:")
print(f" Input tokens: {log.usage.input_tokens}")
print(f" Cached tokens: {log.usage.cached_input_tokens or 0}")
print(f" Output tokens: {log.usage.output_tokens}")
# Calculate savings
if log.usage.cached_input_tokens:
cache_pct = (log.usage.cached_input_tokens / log.usage.input_tokens) * 100
print(f" Cache hit rate: {cache_pct:.1f}%")
Verifying Cache Requests
Use the VSCode Playground to verify your cache configuration:
- Open your BAML function in VSCode
- Run it in the Playground
- Switch from “Prompt Review” to “Raw cURL” view
- Verify the
cache_control metadata is present in the request
Example output:
{
"model": "claude-sonnet-4-5-20250929",
"messages": [
{
"role": "user",
"content": "<document content>"
},
{
"role": "user",
"content": "<question>",
"cache_control": { "type": "ephemeral" }
}
]
}
Best Practices
- Cache Large, Static Content: Cache documentation, large documents, or system prompts that don’t change
- Place Cache Markers Strategically: Put
cache_control after the content you want cached, not before
- Monitor Cache Effectiveness: Use Collectors to track cache hit rates and cost savings
- Consider TTL: Anthropic’s ephemeral cache lasts ~5 minutes; plan request timing accordingly
- Balance Cache Size: Caching works best with substantial content (1000+ tokens)
- Provider Differences: Different providers have different caching mechanisms - test thoroughly
Cost Savings Example
For Anthropic Claude Sonnet 4.5:
- Regular input tokens: $3.00 per million
- Cached input tokens: $0.30 per million (90% cheaper)
If analyzing a 50,000 token document with 10 different questions:
- Without caching: 500,000 tokens × 3.00/M=1.50
- With caching: 50,000 × 3.00/M+450,000×0.30/M = 0.15+0.135 = $0.285
- Savings: $1.215 (81% reduction)
Limitations
- Cache Duration: Ephemeral caches expire after ~5 minutes of inactivity
- Provider-Specific: Implementation varies by provider
- Metadata Requirements: Must configure
allowed_role_metadata for each client
- No Cross-Request Guarantees: Cache hits aren’t guaranteed across different sessions