Prompt caching

The LangSmith SDK includes built-in prompt caching with automatic background refresh. This reduces latency and API calls when repeatedly using the same prompts.

How it works

The prompt cache uses an LRU (Least Recently Used) strategy with:

In-memory storage - Fast access without network calls
TTL-based expiration - Entries refresh after a configurable time
Background refresh - Stale entries update automatically without blocking
Stale-while-revalidate - Returns cached data immediately while refreshing in background

Basic usage

Prompt caching is enabled by default when using the LangSmith client:

from langsmith import Client

client = Client()

# First call fetches from API
prompt = client.pull_prompt("my-prompt")

# Subsequent calls return cached version (if within TTL)
prompt = client.pull_prompt("my-prompt")  # Instant - from cache
prompt = client.pull_prompt("my-prompt")  # Instant - from cache

Configuring the cache

Customize cache behavior globally:

from langsmith import configure_global_prompt_cache

# Configure before creating clients
configure_global_prompt_cache(
    max_size=200,              # Maximum entries (default: 100)
    ttl_seconds=7200,          # 2 hours before refresh (default: 300)
    refresh_interval_seconds=120  # Check for stale entries every 2 min
)

Using a custom cache instance

Create and manage your own cache:

from langsmith import Client
from langsmith.prompt_cache import PromptCache

# Create custom cache
cache = PromptCache(
    max_size=50,
    ttl_seconds=600,  # 10 minutes
    refresh_interval_seconds=60
)

# Pass to client
client = Client(prompt_cache=cache)

# Use normally
prompt = client.pull_prompt("my-prompt")

# Clean up when done
cache.shutdown()

Disabling the cache

Set maxSize to 0 to disable caching:

from langsmith import configure_global_prompt_cache

# Disable globally
configure_global_prompt_cache(max_size=0)

Offline mode

Use infinite TTL for offline/disconnected environments:

from langsmith import Client
from langsmith.prompt_cache import PromptCache

# Create cache with infinite TTL (never expires)
cache = PromptCache(ttl_seconds=None)

# Pre-load prompts from file
cache.load("/path/to/prompts.json")

client = Client(prompt_cache=cache)

# Use cached prompts without network
prompt = client.pull_prompt("my-prompt")  # Works offline

Saving and loading cache

Persist the cache to disk for offline use:

from langsmith import Client

client = Client()

# Use prompts to populate cache
client.pull_prompt("prompt-1")
client.pull_prompt("prompt-2")
client.pull_prompt("prompt-3")

# Save cache to file
client.prompt_cache.dump("/path/to/cache.json")

# Later: Load cache from file
from langsmith.prompt_cache import PromptCache

cache = PromptCache(ttl_seconds=None)  # Offline mode
loaded_count = cache.load("/path/to/cache.json")
print(f"Loaded {loaded_count} prompts")

client = Client(prompt_cache=cache)

Cache metrics

Monitor cache performance:

from langsmith import Client

client = Client()

# Use the cache
client.pull_prompt("my-prompt")  # Miss
client.pull_prompt("my-prompt")  # Hit
client.pull_prompt("my-prompt")  # Hit

# Check metrics
metrics = client.prompt_cache.metrics
print(f"Hits: {metrics.hits}")
print(f"Misses: {metrics.misses}")
print(f"Hit rate: {metrics.hit_rate:.2%}")
print(f"Total requests: {metrics.total_requests}")
print(f"Refreshes: {metrics.refreshes}")
print(f"Refresh errors: {metrics.refresh_errors}")

# Reset metrics
client.prompt_cache.reset_metrics()

Invalidating entries

Manually remove entries from cache:

from langsmith import Client

client = Client()

# Invalidate specific prompt
client.prompt_cache.invalidate("owner/my-prompt:latest")

# Clear entire cache
client.prompt_cache.clear()

Background refresh behavior

When a cached entry becomes stale:

The cached value is returned immediately (no blocking)
A background task refreshes the entry from the API
The next request gets the updated value

This “stale-while-revalidate” pattern ensures:

No latency increase when cache is stale
Always eventually consistent with latest prompt version
Automatic recovery from API errors (keeps stale data)

Best practices

Set appropriate TTL

Balance between freshness and performance:

Short TTL (1-5 min): Frequently changing prompts

Medium TTL (10-30 min): Stable prompts with occasional updates

Long TTL (1+ hours): Static prompts

Infinite TTL: Offline/development mode

Configure max size based on usage

Estimate max unique prompts used:

// If you use ~500 unique prompts
configureGlobalPromptCache({ maxSize: 600 });

Monitor cache metrics

Check hit rate periodically:

if client.prompt_cache.metrics.hit_rate < 0.8:
    print("Consider increasing max_size or ttl_seconds")

Use dump/load for CI/CD

Pre-populate caches in deployment:

# Build time: dump cache
python scripts/cache_prompts.py --dump /app/prompts.json

# Runtime: load cache
export PROMPT_CACHE_FILE=/app/prompts.json

Async clients (Python)

For async Python clients, use AsyncPromptCache:

import asyncio
from langsmith import AsyncClient
from langsmith.prompt_cache import AsyncPromptCache, configure_global_async_prompt_cache

# Configure async cache globally
await configure_global_async_prompt_cache(
    max_size=100,
    ttl_seconds=300
)

# Or use custom instance
cache = AsyncPromptCache(max_size=100, ttl_seconds=300)
await cache.start()  # Start background refresh task

client = AsyncClient(prompt_cache=cache)

# Use normally
prompt = await client.pull_prompt("my-prompt")

# Clean up
await cache.stop()

Get Started

Core Concepts

Guides

How it works

Basic usage

Configuring the cache

Using a custom cache instance

Disabling the cache

Offline mode

Saving and loading cache

Cache metrics

Invalidating entries

Background refresh behavior

Best practices

Async clients (Python)

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

​How it works

​Basic usage

​Configuring the cache

​Using a custom cache instance

​Disabling the cache

​Offline mode

​Saving and loading cache

​Cache metrics

​Invalidating entries

​Background refresh behavior

​Best practices

​Async clients (Python)

Build docs developers (and LLMs) love

How it works

Basic usage

Configuring the cache

Using a custom cache instance

Disabling the cache

Offline mode

Saving and loading cache

Cache metrics

Invalidating entries

Background refresh behavior

Best practices

Async clients (Python)