Skip to main content
The LangSmith SDK includes built-in prompt caching with automatic background refresh. This reduces latency and API calls when repeatedly using the same prompts.

How it works

The prompt cache uses an LRU (Least Recently Used) strategy with:
  • In-memory storage - Fast access without network calls
  • TTL-based expiration - Entries refresh after a configurable time
  • Background refresh - Stale entries update automatically without blocking
  • Stale-while-revalidate - Returns cached data immediately while refreshing in background

Basic usage

Prompt caching is enabled by default when using the LangSmith client:
from langsmith import Client

client = Client()

# First call fetches from API
prompt = client.pull_prompt("my-prompt")

# Subsequent calls return cached version (if within TTL)
prompt = client.pull_prompt("my-prompt")  # Instant - from cache
prompt = client.pull_prompt("my-prompt")  # Instant - from cache

Configuring the cache

Customize cache behavior globally:
from langsmith import configure_global_prompt_cache

# Configure before creating clients
configure_global_prompt_cache(
    max_size=200,              # Maximum entries (default: 100)
    ttl_seconds=7200,          # 2 hours before refresh (default: 300)
    refresh_interval_seconds=120  # Check for stale entries every 2 min
)

Using a custom cache instance

Create and manage your own cache:
from langsmith import Client
from langsmith.prompt_cache import PromptCache

# Create custom cache
cache = PromptCache(
    max_size=50,
    ttl_seconds=600,  # 10 minutes
    refresh_interval_seconds=60
)

# Pass to client
client = Client(prompt_cache=cache)

# Use normally
prompt = client.pull_prompt("my-prompt")

# Clean up when done
cache.shutdown()

Disabling the cache

Set maxSize to 0 to disable caching:
from langsmith import configure_global_prompt_cache

# Disable globally
configure_global_prompt_cache(max_size=0)

Offline mode

Use infinite TTL for offline/disconnected environments:
from langsmith import Client
from langsmith.prompt_cache import PromptCache

# Create cache with infinite TTL (never expires)
cache = PromptCache(ttl_seconds=None)

# Pre-load prompts from file
cache.load("/path/to/prompts.json")

client = Client(prompt_cache=cache)

# Use cached prompts without network
prompt = client.pull_prompt("my-prompt")  # Works offline

Saving and loading cache

Persist the cache to disk for offline use:
from langsmith import Client

client = Client()

# Use prompts to populate cache
client.pull_prompt("prompt-1")
client.pull_prompt("prompt-2")
client.pull_prompt("prompt-3")

# Save cache to file
client.prompt_cache.dump("/path/to/cache.json")

# Later: Load cache from file
from langsmith.prompt_cache import PromptCache

cache = PromptCache(ttl_seconds=None)  # Offline mode
loaded_count = cache.load("/path/to/cache.json")
print(f"Loaded {loaded_count} prompts")

client = Client(prompt_cache=cache)

Cache metrics

Monitor cache performance:
from langsmith import Client

client = Client()

# Use the cache
client.pull_prompt("my-prompt")  # Miss
client.pull_prompt("my-prompt")  # Hit
client.pull_prompt("my-prompt")  # Hit

# Check metrics
metrics = client.prompt_cache.metrics
print(f"Hits: {metrics.hits}")
print(f"Misses: {metrics.misses}")
print(f"Hit rate: {metrics.hit_rate:.2%}")
print(f"Total requests: {metrics.total_requests}")
print(f"Refreshes: {metrics.refreshes}")
print(f"Refresh errors: {metrics.refresh_errors}")

# Reset metrics
client.prompt_cache.reset_metrics()

Invalidating entries

Manually remove entries from cache:
from langsmith import Client

client = Client()

# Invalidate specific prompt
client.prompt_cache.invalidate("owner/my-prompt:latest")

# Clear entire cache
client.prompt_cache.clear()

Background refresh behavior

When a cached entry becomes stale:
  1. The cached value is returned immediately (no blocking)
  2. A background task refreshes the entry from the API
  3. The next request gets the updated value
This “stale-while-revalidate” pattern ensures:
  • No latency increase when cache is stale
  • Always eventually consistent with latest prompt version
  • Automatic recovery from API errors (keeps stale data)

Best practices

1
Set appropriate TTL
2
Balance between freshness and performance:
3
  • Short TTL (1-5 min): Frequently changing prompts
  • Medium TTL (10-30 min): Stable prompts with occasional updates
  • Long TTL (1+ hours): Static prompts
  • Infinite TTL: Offline/development mode
  • 4
    Configure max size based on usage
    5
    Estimate max unique prompts used:
    6
    // If you use ~500 unique prompts
    configureGlobalPromptCache({ maxSize: 600 });
    
    7
    Monitor cache metrics
    8
    Check hit rate periodically:
    9
    if client.prompt_cache.metrics.hit_rate < 0.8:
        print("Consider increasing max_size or ttl_seconds")
    
    10
    Use dump/load for CI/CD
    11
    Pre-populate caches in deployment:
    12
    # Build time: dump cache
    python scripts/cache_prompts.py --dump /app/prompts.json
    
    # Runtime: load cache
    export PROMPT_CACHE_FILE=/app/prompts.json
    

    Async clients (Python)

    For async Python clients, use AsyncPromptCache:
    import asyncio
    from langsmith import AsyncClient
    from langsmith.prompt_cache import AsyncPromptCache, configure_global_async_prompt_cache
    
    # Configure async cache globally
    await configure_global_async_prompt_cache(
        max_size=100,
        ttl_seconds=300
    )
    
    # Or use custom instance
    cache = AsyncPromptCache(max_size=100, ttl_seconds=300)
    await cache.start()  # Start background refresh task
    
    client = AsyncClient(prompt_cache=cache)
    
    # Use normally
    prompt = await client.pull_prompt("my-prompt")
    
    # Clean up
    await cache.stop()
    

    Build docs developers (and LLMs) love