Prerequisites
- Python 3.11-3.13
- OpenAI API key or Azure OpenAI credentials
Step 1: Install GraphRAG
Step 2: Initialize your workspace
Initialize GraphRAG to create the necessary configuration files:- Chat model:
gpt-4.1(default) orgpt-4-turbo - Embedding model:
text-embedding-3-large(default)
settings.yaml- Main configuration file.env- Environment variables (API keys)input/- Directory for source documentsprompts/- Customizable extraction prompts
Step 3: Configure your API key
- OpenAI
- Azure OpenAI
Edit
.env and add your OpenAI API key:Step 4: Add your data
Download the sample dataset (A Christmas Carol):input/ directory. Supported formats:
.txtfiles.csvfiles (with text column).jsonfiles (with text field)
Step 5: Run the indexing pipeline
Build your knowledge graph index:- Chunks your documents into text units
- Extracts entities, relationships, and claims
- Builds a knowledge graph
- Detects hierarchical communities
- Generates community summaries
- Creates vector embeddings
Indexing typically takes 5-15 minutes for the sample dataset with GPT-4. Cost: approximately 2.00 depending on the model and dataset size.
output/ as parquet files.
Step 6: Query your knowledge graph
Now you can ask questions about your data using different search methods:- Global search
- Local search
- DRIFT search
Best for questions about the entire dataset:
Understanding the search methods
Global search
Best for: Holistic dataset understandingUses all community summaries in a map-reduce process. Ideal for “what are the main themes” type questions.
Local search
Best for: Entity-specific questionsRetrieves entities and their neighbors. Ideal for “who is X” or “what does Y do” questions.
DRIFT search
Best for: Multi-level reasoningCombines local entity retrieval with community context. Ideal for complex questions requiring both depth and breadth.
Next steps
Configuration
Customize indexing, chunking, and model settings
Prompt tuning
Generate domain-specific prompts for better extraction
Query engine
Learn about all four search methods in depth
Python API
Use GraphRAG programmatically in your applications
Troubleshooting
Rate limit errors
Rate limit errors
If you hit rate limits:
-
Reduce parallelization in
settings.yaml: -
Add delays between requests:
High costs
High costs
To reduce indexing costs:
- Use
gpt-3.5-turbofor initial testing - Enable caching to avoid re-processing:
- Use the “fast” indexing method:
Out of memory errors
Out of memory errors
For large datasets:
- Increase chunk overlap to reduce total chunks
- Reduce
max_cluster_sizein settings - Process documents in batches
Poor extraction quality
Poor extraction quality
Improve extraction with prompt tuning:This auto-generates prompts optimized for your domain and data.