Why caching matters
LLM API calls are typically:- Expensive (per-token pricing)
- Slow (network latency + generation time)
- Rate-limited (requests per minute)
- Reducing API costs by 90%+ on re-runs
- Speeding up development iterations
- Avoiding rate limit issues
- Enabling incremental workflow development
How caching works
GraphRAG caches LLM responses based on:- Model instance name - Different workflows use different cache partitions
- Input prompt - Exact prompt text including all parameters
- Model configuration - Model name and generation parameters
- GraphRAG checks the cache for a matching entry
- If found, returns the cached response immediately
- If not found, calls the LLM and stores the response
Cache types
JSON cache (recommended)
Stores responses in JSON files, persisted across runs:Memory cache
Stores responses in memory, lost when process ends:No cache
Disable caching entirely:Only disable caching if you need guaranteed fresh responses or are running in a stateless environment.
Cache storage backends
File storage (default)
Best for local development:- Simple and fast
- Easy to inspect and debug
- No external dependencies
- Works offline
- Not shareable across machines
- Requires disk space
Azure Blob Storage
Best for team collaboration:- Share cache across team members
- Persist cache in cloud
- Automatic backup and versioning
- Scale to large datasets
- Requires Azure subscription
- Network latency for cache lookups
- Storage costs
Azure Cosmos DB
Best for high-scale production:- Global distribution
- High availability
- Advanced query capabilities
- Automatic indexing
- Higher costs than Blob Storage
- More complex setup
- Overkill for most use cases
Cache partitioning
GraphRAG partitions cache bymodel_instance_name to keep different workflow steps separate:
Each
model_instance_name creates a separate cache partition, allowing you to clear or preserve specific workflow caches independently.Managing the cache
When cache is used
Cache hits occur when:- Re-running indexing with same data and prompts
- Testing different downstream workflows
- Iterating on post-extraction processing
- Resuming interrupted indexing runs
When cache is bypassed
Cache misses occur when:- Input data changes
- Prompts are modified
- Model configuration changes
- New documents are added
model_instance_namechanges
Clearing the cache
Clear cache when you need fresh results:- File cache
- Blob cache
- Cosmos DB cache
Cache optimization strategies
Strategy 1: Persistent cache for development
Use file-based JSON cache during development:- Fast local access
- Survive process restarts
- Easy to inspect and debug
Strategy 2: Shared cache for teams
Use Azure Blob Storage for team collaboration:- Share cache across team members
- Reduce duplicate API calls
- Save collective API costs
Strategy 3: Separate caches per experiment
Use different cache directories for different experiments:- Compare different approaches
- Preserve baseline results
- Easy rollback to previous configs
Strategy 4: Disable cache for production
Disable caching in production if you need guaranteed fresh results:- No stale data
- Predictable behavior
- No cache management overhead
Cost considerations
Cache effectiveness varies by workflow:Entity extraction (extract_graph)
Entity extraction (extract_graph)
High cache value - Most expensive operation
- Processes every text chunk through LLM
- Multiple gleaning passes increase costs
- Cache saves 80-95% of extraction costs on re-runs
Description summarization (summarize_descriptions)
Description summarization (summarize_descriptions)
Medium cache value - Moderate costs
- Summarizes entity descriptions
- Fewer calls than extraction
- Cache saves 70-90% on re-runs
Community reports (community_reports)
Community reports (community_reports)
Medium cache value - Report generation costs
- Generates one report per community
- Can be expensive for large graphs
- Cache saves 80-95% on re-runs
Claim extraction (extract_claims)
Claim extraction (extract_claims)
High cache value - If enabled
- Similar to entity extraction
- Disabled by default
- Cache essential when tuning claim prompts
Example: Development workflow
Optimal caching strategy for development:Iterate on prompts
Modify prompts in Re-run indexing - only re-extracts, reuses downstream cache
prompts/ directoryClear specific cache partition:Tune downstream workflows
Modify chunking, clustering, or other settingsKeep extraction cache intact:Re-run indexing - reuses expensive extraction cache
Troubleshooting
Cache not being used
Prompt changes
Prompt changes
Even small prompt modifications invalidate cache. Ensure prompts are identical for cache hits.
Model configuration changes
Model configuration changes
Changing
model, temperature, or other parameters bypasses cache.Input data changes
Input data changes
Modified source documents generate different prompts, missing cache.
Cache path issues
Cache path issues
Verify cache directory exists and has write permissions.
Cache growing too large
Clear old experiments
Clear old experiments
Remove cache directories for completed experiments:
Use selective caching
Use selective caching
Only cache expensive operations:
Compress old cache
Compress old cache
Archive and compress inactive cache:
Best practices
Next steps
Storage
Learn about other storage configurations
Settings reference
Complete configuration options
LLM models
Configure language models
Prompt tuning
Optimize prompts with caching