Skip to main content
This guide covers proven strategies for getting the best results from GraphRAG, including configuration optimization, cost management, and workflow recommendations.

Before you start

GraphRAG indexing can be expensive. Always start small, test thoroughly, and understand costs before scaling to production datasets.

Initial testing strategy

1

Start with a small dataset

Begin with 5-10 representative documents:
# Copy a small sample to test
mkdir ./test-project/input
cp ~/documents/sample*.txt ./test-project/input/
2

Use affordable models for testing

During development, use cost-effective models:
settings.yaml
llm:
  model: gpt-3.5-turbo  # or gpt-4o-mini

embedding:
  model: text-embedding-3-small
3

Enable caching

Always enable caching to avoid redundant API calls:
settings.yaml
cache:
  type: file
  base_dir: ./cache
4

Run dry-run first

Validate configuration before indexing:
graphrag index --root ./test-project --dry-run --verbose

Prompt tuning

Always run prompt tuning before indexing your full dataset. Generic prompts rarely yield optimal results for domain-specific data.

When to tune prompts

  • New domain: Medical, legal, scientific, business data
  • Specialized terminology: Industry-specific jargon or concepts
  • Non-English content: Different language or mixed languages
  • Specific entity types: You know what entities matter for your use case

Tuning workflow

1

Prepare representative data

Select documents that represent your full dataset:
# Random sampling
ls ~/all-documents/*.txt | shuf -n 20 | xargs -I {} cp {} ./project/input/
2

Run prompt tuning

graphrag prompt-tune \
  --root ./project \
  --domain "medical research" \
  --selection-method auto \
  --n-subset-max 300 \
  --language "English"
For large datasets, use --selection-method auto with k-means clustering.
3

Review generated prompts

Check ./project/prompts/ for:
  • Entity types discovered
  • Example extractions
  • Domain-specific language
4

Customize if needed

Edit prompts to:
  • Add missing entity types
  • Adjust extraction instructions
  • Improve examples
5

Test on sample data

Run indexing on a small sample to validate prompt quality:
graphrag index --root ./project --verbose

Prompt tuning parameters

Random (default):
  • Fast and simple
  • Good for uniform datasets
  • Use with --limit 15-20
Top:
  • Uses first N documents
  • Good when documents are pre-sorted
  • Use with --limit 15-20
Auto (recommended for large datasets):
  • Uses k-means clustering
  • Selects representative documents
  • Use with --n-subset-max 300 and --k 15
Be specific but not overly narrow:✓ Good:
  • “medical research papers”
  • “corporate financial reports”
  • “legal contracts and agreements”
✗ Too broad:
  • “science”
  • “business”
✗ Too narrow:
  • “phase 3 clinical trials for oncology drugs”
Specify the primary language of your content:
graphrag prompt-tune --language "Spanish"
graphrag prompt-tune --language "French"
graphrag prompt-tune --language "Japanese"
For multilingual datasets, choose the dominant language.

Configuration optimization

Model selection

Choose models based on your requirements:
Goal: Fast iteration, low cost
llm:
  model: gpt-4o-mini
  temperature: 0.0

embedding:
  model: text-embedding-3-small
Cost: ~$0.05-0.15 per 1000 documents

Chunking configuration

Optimize chunking for your document structure:
settings.yaml
chunking:
  size: 300        # Tokens per chunk
  overlap: 100     # Overlap between chunks
  encoding_model: cl100k_base
Guidelines:
Document TypeChunk SizeOverlapRationale
Short articles20050Preserve complete thoughts
Long reports300-400100Balance context and granularity
Technical docs400-500100-150Keep technical concepts together
Transcripts300100Natural conversation flow
Legal documents500150Maintain clause integrity
Larger chunks mean fewer LLM calls (lower cost) but may reduce extraction granularity. Start with 300 and adjust based on results.

Entity extraction settings

settings.yaml
entity_extraction:
  max_gleanings: 1  # Additional extraction passes
  
  # Optional: specify entity types
  entity_types:
    - PERSON
    - ORGANIZATION
    - LOCATION
    - EVENT
    - TECHNOLOGY
max_gleanings trade-offs:
  • 0: Fastest, cheapest, lower recall
  • 1: Recommended balance (default)
  • 2+: Highest quality, expensive, diminishing returns
Each gleaning pass doubles the cost of entity extraction. Only increase for critical use cases.

Community detection

settings.yaml
community_reports:
  max_report_length: 1500  # Tokens per community report
Guidelines:
  • 500-1000: Brief summaries, lower cost
  • 1500: Recommended default, balanced detail
  • 2000-3000: Comprehensive reports, higher cost

Rate limiting

Set appropriate rate limits to avoid throttling:
settings.yaml
llm:
  requests_per_minute: 60
  tokens_per_minute: 80000

embedding:
  requests_per_minute: 60
  tokens_per_minute: 150000
Free tier:
  • 3 RPM, 40,000 TPM (GPT-4)
  • 5 RPM, 100,000 TPM (GPT-3.5)
Tier 1 ($5+ spent):
  • 500 RPM, 80,000 TPM (GPT-4o)
  • 3,500 RPM, 200,000 TPM (GPT-3.5)
Set to 90% of your limit to be safe.

Cost management

Estimate costs before indexing

Run a test with a small sample and extrapolate:
# Index 10 documents
graphrag index --root ./test --verbose

# Check logs for token usage
grep "tokens" ./test/output/logs/app.log

# Extrapolate: (tokens_used / 10) * total_documents * model_price

Cost reduction strategies

Enable caching

Prevents redundant LLM calls during re-indexing
cache:
  type: file
  base_dir: ./cache

Larger chunks

Fewer chunks = fewer LLM calls
chunking:
  size: 400

Reduce gleanings

Each pass costs more
entity_extraction:
  max_gleanings: 0

Use cheaper models

For development and testing
llm:
  model: gpt-4o-mini

Cost tracking

Monitor spending:
  • OpenAI: Check usage at platform.openai.com/usage
  • Azure: Monitor costs in Azure Portal → Cost Management
  • Local logs: Track token counts in GraphRAG logs

Query optimization

Choose the right search method

Community level selection

graphrag query "your question" --community-level 2
Guidelines:
  • Level 0: Entire dataset (very broad, expensive)
  • Level 1: Major themes (broad summaries)
  • Level 2: Recommended default (balanced granularity)
  • Level 3+: Fine-grained details (more specific)
Start with level 2. Increase for more specific queries, decrease for very broad questions.

Response type optimization

Guide the format of responses:
# Concise answers
graphrag query "question" --response-type "Single Sentence"

# Structured output
graphrag query "question" --response-type "List of 3-7 Points"

# Detailed responses
graphrag query "question" --response-type "Multiple Paragraphs"

# Custom format
graphrag query "question" --response-type "Executive summary with key metrics"

Data preparation

Document formatting

GraphRAG supports:
  • Plain text (.txt)
  • Markdown (.md)
  • CSV (.csv)
  • Other formats via custom loaders
Recommendation: Convert documents to plain text or markdown for best results.
Well-structured documents yield better results:Good structure:
# Document Title

## Section 1

Content with clear paragraphs...

## Section 2

More structured content...
Poor structure:
  • No headings or sections
  • Mixed formatting
  • Excessive special characters
  • Malformed text from PDF extraction
Include relevant metadata in documents:
---
title: Research Paper Title
author: John Smith
date: 2024-01-15
category: Medical Research
---

# Main content...
GraphRAG can extract entities from metadata.

Data cleaning

Clean your data before indexing:
import re

def clean_document(text: str) -> str:
    # Remove excessive whitespace
    text = re.sub(r'\s+', ' ', text)
    
    # Remove page numbers
    text = re.sub(r'Page \d+', '', text)
    
    # Fix common OCR errors
    text = text.replace('l1', 'll')  # Example
    
    # Remove headers/footers
    # ... custom logic
    
    return text.strip()

Storage and scalability

Local vs. cloud storage

Best for:
  • Development
  • Small datasets (<10K documents)
  • Testing
storage:
  type: file
  base_dir: ./output

vector_store:
  type: lancedb
  db_uri: ./lancedb

Large dataset handling

For datasets with >10,000 documents:
1

Partition your data

Split into logical groups:
input/
  medical/
  legal/
  financial/
Index separately or together based on use case.
2

Optimize chunking

Use larger chunks to reduce total chunk count:
chunking:
  size: 400
  overlap: 100
3

Use cloud storage

Azure Blob + Azure AI Search for scalability.
4

Implement incremental updates

Use graphrag update for new documents:
graphrag update --root ./project

Workflow best practices

Development workflow

1

Initial setup

graphrag init --root ./project
# Configure settings.yaml and .env
2

Small sample test

# 5-10 documents
graphrag index --root ./project --verbose
3

Prompt tuning

graphrag prompt-tune --root ./project --domain "your domain"
4

Validation

# Test queries
graphrag query "test question" --method global
graphrag query "test question" --method local
5

Iterate

  • Adjust configuration
  • Refine prompts
  • Test again
6

Scale up

# Full dataset
graphrag index --root ./project --verbose

Version control

Track your GraphRAG configuration:
# .gitignore
.env
cache/
output/
lancedb/
*.parquet
*.log

# Commit these
settings.yaml
prompts/
input/  # Or use DVC for large files

Monitoring and debugging

Enable verbose logging during development:
graphrag index --root ./project --verbose
Check logs for:
  • Token usage
  • API errors
  • Extraction quality
  • Processing time
tail -f ./project/output/logs/app.log

Common pitfalls

Problem: Generic prompts produce poor extractionsSolution: Always run graphrag prompt-tune for domain-specific data
Problem: Expensive mistakes on full datasetSolution: Test with 5-10 documents first, validate results, then scale
Problem: API throttling, failed indexingSolution: Configure rate limits to 90% of your quota
Problem: Redundant API calls cost moneySolution: Always enable caching for development
Problem: Poor query resultsSolution: Match search method to query type (see query optimization)
Problem: Garbage in, garbage outSolution: Clean and structure documents before indexing

Performance benchmarks

Typical indexing performance:
DocumentsModelChunksTimeCost
100GPT-4o~2,00015-30 min$1-3
1,000GPT-4o~20,0002-4 hours$10-30
10,000GPT-4o~200,00020-40 hours$100-300
These are rough estimates. Actual costs depend on document length, chunk size, gleanings, and model pricing.

Next steps

CLI usage

Master the command-line interface

Configuration

Deep dive into settings

Prompt tuning

Optimize prompts for your domain

Migration guide

Upgrade between versions

Build docs developers (and LLMs) love