What is GraphRAG?
GraphRAG is a data pipeline and transformation suite designed to extract meaningful, structured data from unstructured text using the power of LLMs. It addresses critical limitations in baseline RAG systems:Connect the dots
Baseline RAG struggles when answers require traversing disparate pieces of information through shared attributes. GraphRAG excels at synthesizing insights across connected data.
Holistic understanding
Traditional RAG performs poorly when summarizing semantic concepts over large collections. GraphRAG uses hierarchical community structures for dataset-wide comprehension.
Entity-centric reasoning
GraphRAG builds knowledge graphs where entities and relationships provide structured access to information, enabling more precise retrieval.
Multi-level insights
Community detection creates hierarchical summaries at multiple granularities, from high-level themes to detailed local clusters.
The GraphRAG process
The GraphRAG workflow consists of two major phases: indexing and querying.Indexing phase
The indexing pipeline transforms raw documents into a structured knowledge model:Slice documents into text units
Input documents are chunked into analyzable text units (default: 1200 tokens) that serve as the foundation for extraction and provide fine-grained source references.
Extract knowledge graph
LLMs extract entities, relationships, and optional claims from each text unit. Entities represent people, places, organizations, or events. Relationships connect entities with descriptive context.
Detect communities
Hierarchical Leiden algorithm clusters the entity graph into communities at multiple levels, revealing the organizational structure of your data.
Generate summaries
Each community receives an LLM-generated summary from bottom-up, creating hierarchical understanding of the dataset at varying levels of detail.
Query phase
At query time, GraphRAG provides multiple search strategies tailored to different question types:- Global search
- Local search
- DRIFT search
- Basic search
Best for: Questions requiring holistic understanding of the entire datasetUses community summaries in a map-reduce fashion to answer questions like “What are the top themes in this data?” or “What are the most significant trends?”Leverages the hierarchical community structure to provide comprehensive, dataset-wide insights.
Key advantages over baseline RAG
Structured knowledge representation
Structured knowledge representation
GraphRAG creates an explicit knowledge graph where entities and relationships are first-class objects. This structure enables:
- Traversal of multi-hop connections
- Understanding of entity importance through graph metrics
- Relationship-aware retrieval
- Community-based organization
Hierarchical summarization
Hierarchical summarization
Community detection and bottom-up summarization provide:
- Multi-level understanding from global themes to local details
- Pre-computed summaries that reduce query-time LLM costs
- Ability to reason about dataset structure
- Scalable comprehension of large document collections
Multiple retrieval strategies
Multiple retrieval strategies
Different query types require different approaches:
- Global search for comprehensive, dataset-wide questions
- Local search for entity-specific inquiries
- DRIFT search for balanced exploration
- Basic search for simple similarity matching
Provenance and citations
Provenance and citations
Every extracted fact maintains links to:
- Source text units
- Original documents
- Related entities and relationships
- Community memberships
The GraphRAG knowledge model
The indexing process produces a structured knowledge model with these core entity types:- Document: Input files (individual CSV rows or .txt files)
- TextUnit: Chunks of text for analysis and source references
- Entity: Extracted people, places, events, organizations with types and descriptions
- Relationship: Connections between entities with descriptive context
- Covariate: Optional time-bound claims and statements about entities
- Community: Hierarchical clusters of entities from community detection
- Community Report: LLM-generated summaries of each community’s contents
All outputs are stored as Parquet tables by default, with embeddings written to your configured vector store.
When to use GraphRAG
GraphRAG is particularly powerful for:- Complex reasoning tasks that require connecting disparate pieces of information
- Large document collections where understanding overall themes and structure matters
- Private datasets where LLMs need to reason about previously unseen data
- Multi-hop questions that require traversing relationships between entities
- Exploratory analysis where you need both high-level summaries and detailed local information
Next steps
Knowledge graphs
Learn how GraphRAG extracts and structures entity-relationship graphs
Indexing pipeline
Explore the multi-phase indexing workflow in detail
Community detection
Understand hierarchical Leiden clustering and community summarization
Retrieval methods
Compare global, local, DRIFT, and basic search strategies