Skip to main content
GraphRAG is a structured, hierarchical approach to Retrieval Augmented Generation (RAG) that fundamentally differs from traditional semantic search methods. Instead of relying on plain text snippets, GraphRAG extracts a knowledge graph from raw text, builds a community hierarchy, generates summaries, and leverages these structures for enhanced reasoning.

What is GraphRAG?

GraphRAG is a data pipeline and transformation suite designed to extract meaningful, structured data from unstructured text using the power of LLMs. It addresses critical limitations in baseline RAG systems:

Connect the dots

Baseline RAG struggles when answers require traversing disparate pieces of information through shared attributes. GraphRAG excels at synthesizing insights across connected data.

Holistic understanding

Traditional RAG performs poorly when summarizing semantic concepts over large collections. GraphRAG uses hierarchical community structures for dataset-wide comprehension.

Entity-centric reasoning

GraphRAG builds knowledge graphs where entities and relationships provide structured access to information, enabling more precise retrieval.

Multi-level insights

Community detection creates hierarchical summaries at multiple granularities, from high-level themes to detailed local clusters.

The GraphRAG process

The GraphRAG workflow consists of two major phases: indexing and querying.

Indexing phase

The indexing pipeline transforms raw documents into a structured knowledge model:
1

Slice documents into text units

Input documents are chunked into analyzable text units (default: 1200 tokens) that serve as the foundation for extraction and provide fine-grained source references.
2

Extract knowledge graph

LLMs extract entities, relationships, and optional claims from each text unit. Entities represent people, places, organizations, or events. Relationships connect entities with descriptive context.
3

Detect communities

Hierarchical Leiden algorithm clusters the entity graph into communities at multiple levels, revealing the organizational structure of your data.
4

Generate summaries

Each community receives an LLM-generated summary from bottom-up, creating hierarchical understanding of the dataset at varying levels of detail.
5

Create embeddings

Text embeddings are generated for entities, text units, and community reports to enable semantic search during retrieval.

Query phase

At query time, GraphRAG provides multiple search strategies tailored to different question types:

Key advantages over baseline RAG

GraphRAG creates an explicit knowledge graph where entities and relationships are first-class objects. This structure enables:
  • Traversal of multi-hop connections
  • Understanding of entity importance through graph metrics
  • Relationship-aware retrieval
  • Community-based organization
Community detection and bottom-up summarization provide:
  • Multi-level understanding from global themes to local details
  • Pre-computed summaries that reduce query-time LLM costs
  • Ability to reason about dataset structure
  • Scalable comprehension of large document collections
Different query types require different approaches:
  • Global search for comprehensive, dataset-wide questions
  • Local search for entity-specific inquiries
  • DRIFT search for balanced exploration
  • Basic search for simple similarity matching
Every extracted fact maintains links to:
  • Source text units
  • Original documents
  • Related entities and relationships
  • Community memberships
This enables transparent, verifiable results with clear source attribution.

The GraphRAG knowledge model

The indexing process produces a structured knowledge model with these core entity types:
  • Document: Input files (individual CSV rows or .txt files)
  • TextUnit: Chunks of text for analysis and source references
  • Entity: Extracted people, places, events, organizations with types and descriptions
  • Relationship: Connections between entities with descriptive context
  • Covariate: Optional time-bound claims and statements about entities
  • Community: Hierarchical clusters of entities from community detection
  • Community Report: LLM-generated summaries of each community’s contents
All outputs are stored as Parquet tables by default, with embeddings written to your configured vector store.

When to use GraphRAG

GraphRAG is particularly powerful for:
  • Complex reasoning tasks that require connecting disparate pieces of information
  • Large document collections where understanding overall themes and structure matters
  • Private datasets where LLMs need to reason about previously unseen data
  • Multi-hop questions that require traversing relationships between entities
  • Exploratory analysis where you need both high-level summaries and detailed local information
GraphRAG indexing can be expensive in terms of LLM API costs and processing time. Start with a small dataset to understand the process and costs before scaling up.

Next steps

Knowledge graphs

Learn how GraphRAG extracts and structures entity-relationship graphs

Indexing pipeline

Explore the multi-phase indexing workflow in detail

Community detection

Understand hierarchical Leiden clustering and community summarization

Retrieval methods

Compare global, local, DRIFT, and basic search strategies

Build docs developers (and LLMs) love