Pipeline overview
The indexing process transforms raw documents through six distinct phases:Phase 1: Compose text units
The first phase transforms input documents into analyzable text chunks called text units.Document loading
GraphRAG supports multiple input formats through configurable input readers:- Supported formats
- Configuration
Built-in readers:
- Text files (.txt): Individual documents
- CSV files: Rows as documents with configurable text columns
- JSON files: Structured document collections
Text chunking
Documents are split into text units with configurable parameters:Chunk size
Chunk size
Default: 1200 tokensThe size of each text unit affects:
- Extraction quality: Smaller chunks provide more focused extraction but may miss broader context
- Processing speed: Larger chunks reduce the number of LLM calls but may be less precise
- Reference granularity: Smaller chunks give finer-grained source citations
Overlap strategy
Overlap strategy
Text units can overlap to preserve context across chunk boundaries:
- No overlap: Chunks are completely independent (faster, may miss boundary context)
- Moderate overlap (100-200 tokens): Balances context preservation with efficiency
- High overlap (300+ tokens): Maximum context but more redundancy and cost
Implementation
Implementation
Chunking is performed in the Each text unit receives:
create_base_text_units workflow:- Unique ID
- Text content
- Token count
- Document ID reference
- Position in source document
Text units serve dual purposes: they are the analysis units for extraction AND the source references that enable provenance tracking.
Phase 2: Document processing
This phase creates the final documents table by linking documents to their constituent text units.Document enrichment
Original document metadata is preserved:
- Document ID
- Title or filename
- Timestamps (if available)
- Custom attributes
Phase 3: Graph extraction
This is the core knowledge extraction phase where entities, relationships, and claims are extracted from text units.Entity and relationship extraction
- Workflow
- LLM interaction
- Merging strategy
The
extract_graph workflow processes each text unit:Summarization
After merging, entities and relationships have multiple descriptions that need consolidation:Summarization reduces token counts and creates coherent, non-redundant descriptions for each entity and relationship.
Claim extraction
Optional workflow that extracts time-bound factual claims:FastGraphRAG mode
FastGraphRAG mode
GraphRAG supports a FastGraphRAG option that uses NLP instead of LLMs for entity/relationship extraction:
- Faster processing: No LLM calls for extraction
- Lower cost: Only LLM calls for summarization and community reports
- Lower quality: NLP extraction is less accurate than LLM extraction
- No claims: Claim extraction is always skipped in FastGraphRAG mode
Phase 4: Graph augmentation
This phase applies community detection to discover the organizational structure of the knowledge graph.Community detection workflow
Thecreate_communities workflow applies hierarchical Leiden clustering:
- Algorithm
- Configuration
- Output structure
Hierarchical Leiden is a community detection algorithm that:
- Treats the graph as undirected
- Applies Leiden clustering recursively
- Creates hierarchy until max_cluster_size is reached
- Produces multiple levels of granularity
Largest connected component (LCC)
Optional preprocessing step:When
use_lcc: true, only the largest connected component of the graph is used for community detection. This filters out small disconnected clusters.Phase 5: Community summarization
This phase generates human-readable summaries for each community.Report generation
Thecreate_community_reports workflow creates summaries:
Gather community data
For each community, collect:
- Entity descriptions
- Relationship descriptions
- Covariate/claim information (if available)
- Text unit context
LLM summarization
The LLM generates a structured report including:
- Executive summary
- Key entities and their roles
- Important relationships
- Main themes and topics
- Supporting claims
Bottom-up approach
Bottom-up approach
Community summarization proceeds from leaf communities upward:
- Leaf level (level 0): Summarize individual entities and relationships
- Mid levels: Summarize child community reports
- Root level: Highest-level summary of entire dataset
Configuration
Configuration
Phase 6: Text embeddings
The final phase generates vector embeddings for semantic search.Embedding workflows
Text unit embeddings
Embed the text content of each text unit for basic semantic search.
Entity embeddings
Embed entity descriptions for entity-based retrieval in local search.
Community report embeddings
Embed community summaries for global search retrieval.
Vector store integration
Embeddings are written to your configured vector store:- Supported stores
- Configuration
Built-in vector store implementations:
- LanceDB: Local vector database
- Azure AI Search: Cloud vector search service
- Azure Cosmos DB: NoSQL database with vector search
Embeddings enable the semantic similarity searches that serve as entry points into the knowledge graph during query time.
Pipeline architecture
The indexing engine is built on a flexible workflow system:Key architectural concepts
Workflows
Workflows
Workflows are named sequences of operations that can be:
- Standard: Built-in workflows like
extract_graph,create_communities - Custom: User-defined workflows registered via the factory
- Operates on tables from previous workflows
- Produces output tables
- Can be run independently or as part of pipeline
LLM caching
LLM caching
Critical for resilience and efficiency:
- Cache key: Prompt + parameters uniquely identify requests
- Cache hit: Returns stored result instead of API call
- Benefits:
- Resilience to network errors
- Idempotent pipeline execution
- Cost savings on reruns
Factory pattern
Factory pattern
GraphRAG uses factories for extensibility:
- Language models: Custom model providers
- Input readers: Custom document formats
- Cache: Custom cache storage
- Storage: Custom table storage
- Vector stores: Custom vector databases
- Workflows: Custom pipeline steps
Running the indexing pipeline
- CLI
- Python API
Best practices
Start small
Test with a small dataset first to understand costs, processing time, and output quality before scaling up.
Monitor extraction
Check entity and relationship counts after Phase 3. Low numbers indicate prompt tuning may be needed.
Tune prompts
Use the prompt tuning process to optimize extraction for your domain before processing large datasets.
Configure caching
Ensure LLM caching is enabled and properly configured to handle network issues and enable reruns.
Next steps
Community detection
Deep dive into hierarchical Leiden clustering
Retrieval methods
Learn how indexed data powers different search strategies