Before you start
Initial testing strategy
Prompt tuning
When to tune prompts
- New domain: Medical, legal, scientific, business data
- Specialized terminology: Industry-specific jargon or concepts
- Non-English content: Different language or mixed languages
- Specific entity types: You know what entities matter for your use case
Tuning workflow
Review generated prompts
Check
./project/prompts/ for:- Entity types discovered
- Example extractions
- Domain-specific language
Customize if needed
Edit prompts to:
- Add missing entity types
- Adjust extraction instructions
- Improve examples
Prompt tuning parameters
Selection methods
Selection methods
Random (default):
- Fast and simple
- Good for uniform datasets
- Use with
--limit 15-20
- Uses first N documents
- Good when documents are pre-sorted
- Use with
--limit 15-20
- Uses k-means clustering
- Selects representative documents
- Use with
--n-subset-max 300and--k 15
Domain specification
Domain specification
Be specific but not overly narrow:✓ Good:
- “medical research papers”
- “corporate financial reports”
- “legal contracts and agreements”
- “science”
- “business”
- “phase 3 clinical trials for oncology drugs”
Language settings
Language settings
Specify the primary language of your content:For multilingual datasets, choose the dominant language.
Configuration optimization
Model selection
Choose models based on your requirements:- Development
- Production
- High-end
Goal: Fast iteration, low costCost: ~$0.05-0.15 per 1000 documents
Chunking configuration
Optimize chunking for your document structure:settings.yaml
| Document Type | Chunk Size | Overlap | Rationale |
|---|---|---|---|
| Short articles | 200 | 50 | Preserve complete thoughts |
| Long reports | 300-400 | 100 | Balance context and granularity |
| Technical docs | 400-500 | 100-150 | Keep technical concepts together |
| Transcripts | 300 | 100 | Natural conversation flow |
| Legal documents | 500 | 150 | Maintain clause integrity |
Entity extraction settings
settings.yaml
- 0: Fastest, cheapest, lower recall
- 1: Recommended balance (default)
- 2+: Highest quality, expensive, diminishing returns
Community detection
settings.yaml
- 500-1000: Brief summaries, lower cost
- 1500: Recommended default, balanced detail
- 2000-3000: Comprehensive reports, higher cost
Rate limiting
Set appropriate rate limits to avoid throttling:settings.yaml
- OpenAI
- Azure OpenAI
Free tier:
- 3 RPM, 40,000 TPM (GPT-4)
- 5 RPM, 100,000 TPM (GPT-3.5)
- 500 RPM, 80,000 TPM (GPT-4o)
- 3,500 RPM, 200,000 TPM (GPT-3.5)
Cost management
Estimate costs before indexing
Run a test with a small sample and extrapolate:Cost reduction strategies
Enable caching
Prevents redundant LLM calls during re-indexing
Larger chunks
Fewer chunks = fewer LLM calls
Reduce gleanings
Each pass costs more
Use cheaper models
For development and testing
Cost tracking
Monitor spending:- OpenAI: Check usage at platform.openai.com/usage
- Azure: Monitor costs in Azure Portal → Cost Management
- Local logs: Track token counts in GraphRAG logs
Query optimization
Choose the right search method
- Global Search
- Local Search
- DRIFT Search
- Basic Search
Best for:
- Dataset-wide questions
- Theme identification
- Summarization
- Trend analysis
- “What are the main themes?”
- “Summarize the key findings”
- “What trends appear across documents?”
Community level selection
- Level 0: Entire dataset (very broad, expensive)
- Level 1: Major themes (broad summaries)
- Level 2: Recommended default (balanced granularity)
- Level 3+: Fine-grained details (more specific)
Response type optimization
Guide the format of responses:Data preparation
Document formatting
Supported formats
Supported formats
GraphRAG supports:
- Plain text (
.txt) - Markdown (
.md) - CSV (
.csv) - Other formats via custom loaders
Document structure
Document structure
Well-structured documents yield better results:✓ Good structure:✗ Poor structure:
- No headings or sections
- Mixed formatting
- Excessive special characters
- Malformed text from PDF extraction
Metadata inclusion
Metadata inclusion
Include relevant metadata in documents:GraphRAG can extract entities from metadata.
Data cleaning
Clean your data before indexing:Storage and scalability
Local vs. cloud storage
- Local (File)
- Cloud (Azure Blob)
Best for:
- Development
- Small datasets (<10K documents)
- Testing
Large dataset handling
For datasets with >10,000 documents:Workflow best practices
Development workflow
Version control
Track your GraphRAG configuration:Monitoring and debugging
Enable verbose logging during development:- Token usage
- API errors
- Extraction quality
- Processing time
Common pitfalls
Not running prompt tuning
Not running prompt tuning
Problem: Generic prompts produce poor extractionsSolution: Always run
graphrag prompt-tune for domain-specific dataSkipping small-scale testing
Skipping small-scale testing
Problem: Expensive mistakes on full datasetSolution: Test with 5-10 documents first, validate results, then scale
Ignoring rate limits
Ignoring rate limits
Problem: API throttling, failed indexingSolution: Configure rate limits to 90% of your quota
Disabling cache
Disabling cache
Problem: Redundant API calls cost moneySolution: Always enable caching for development
Using wrong search method
Using wrong search method
Problem: Poor query resultsSolution: Match search method to query type (see query optimization)
Poor document quality
Poor document quality
Problem: Garbage in, garbage outSolution: Clean and structure documents before indexing
Performance benchmarks
Typical indexing performance:| Documents | Model | Chunks | Time | Cost |
|---|---|---|---|---|
| 100 | GPT-4o | ~2,000 | 15-30 min | $1-3 |
| 1,000 | GPT-4o | ~20,000 | 2-4 hours | $10-30 |
| 10,000 | GPT-4o | ~200,000 | 20-40 hours | $100-300 |
Next steps
CLI usage
Master the command-line interface
Configuration
Deep dive into settings
Prompt tuning
Optimize prompts for your domain
Migration guide
Upgrade between versions