Indexing pipeline

The GraphRAG indexing pipeline is a configurable data transformation suite that extracts meaningful, structured data from unstructured text using LLMs. The pipeline is composed of workflows, standard and custom steps, prompt templates, and input/output adapters.

Pipeline overview

The indexing process transforms raw documents through six distinct phases:

Phase 1: Compose text units

The first phase transforms input documents into analyzable text chunks called text units.

Document loading

GraphRAG supports multiple input formats through configurable input readers:

Supported formats
Configuration

Built-in readers:

Text files (.txt): Individual documents
CSV files: Rows as documents with configurable text columns
JSON files: Structured document collections

Custom readers: You can implement custom input readers for other formats:

# Register a custom input reader
from graphrag.index.input.factory import register_input_reader

register_input_reader("my_format", MyCustomReader)

Input configuration in your settings:

input:
  type: file  # or blob, cosmosdb
  file_type: text  # or csv, json
  base_dir: "./input"
  file_pattern: ".*\\.txt$"
  
  # For CSV
  source_column: "text"
  timestamp_column: "date"
  
  # For custom readers
  encoding: "utf-8"

Text chunking

Documents are split into text units with configurable parameters:

Chunk size

Default: 1200 tokensThe size of each text unit affects:

Extraction quality: Smaller chunks provide more focused extraction but may miss broader context
Processing speed: Larger chunks reduce the number of LLM calls but may be less precise
Reference granularity: Smaller chunks give finer-grained source citations

chunks:
  size: 1200  # tokens
  overlap: 100  # tokens
  group_by_columns: ["id"]  # optional: group chunks by document attributes

Overlap strategy

Text units can overlap to preserve context across chunk boundaries:

No overlap: Chunks are completely independent (faster, may miss boundary context)
Moderate overlap (100-200 tokens): Balances context preservation with efficiency
High overlap (300+ tokens): Maximum context but more redundancy and cost

The overlap ensures entities or relationships spanning chunk boundaries are captured.

Implementation

Chunking is performed in the create_base_text_units workflow:

# Text units creation
text_units = documents.chunk(
    size=chunk_size,
    overlap=chunk_overlap,
    encoding=token_encoder
)

Each text unit receives:

Unique ID
Text content
Token count
Document ID reference
Position in source document

Text units serve dual purposes: they are the analysis units for extraction AND the source references that enable provenance tracking.

Phase 2: Document processing

This phase creates the final documents table by linking documents to their constituent text units.

Document enrichment

Original document metadata is preserved:

Document ID
Title or filename
Timestamps (if available)
Custom attributes

Text unit linking

Each document is linked to all text units created from it:

document.text_unit_ids = ["unit_001", "unit_002", "unit_003"]

Table export

The documents table is exported as Parquet for downstream use and provenance.

Phase 3: Graph extraction

This is the core knowledge extraction phase where entities, relationships, and claims are extracted from text units.

Entity and relationship extraction

Workflow
LLM interaction
Merging strategy

The extract_graph workflow processes each text unit:

async def run_workflow(
    config: GraphRagConfig,
    context: PipelineRunContext,
) -> WorkflowFunctionOutput:
    text_units = await reader.text_units()
    
    # Extract entities and relationships
    entities, relationships, raw_entities, raw_relationships = await extract_graph(
        text_units=text_units,
        extraction_model=extraction_model,
        extraction_prompt=extraction_prompts.extraction_prompt,
        entity_types=config.extract_graph.entity_types,
        max_gleanings=config.extract_graph.max_gleanings,
    )

Entities and relationships from all text units are merged:Entity merging:

Same title + type → merge descriptions into list
Aggregate text_unit_ids
Preserve all unique information

Relationship merging:

Same source + target → merge descriptions into list
Aggregate text_unit_ids
Combine weights

# Entities with same title and type are merged
if len(extracted_entities) == 0:
    raise ValueError("No entities detected during extraction.")

if len(extracted_relationships) == 0:
    raise ValueError("No relationships detected during extraction.")

Summarization

After merging, entities and relationships have multiple descriptions that need consolidation:

Summarization reduces token counts and creates coherent, non-redundant descriptions for each entity and relationship.

async def get_summarized_entities_relationships(
    extracted_entities: pd.DataFrame,
    extracted_relationships: pd.DataFrame,
    model: LLMCompletion,
    max_summary_length: int,
    summarization_prompt: str,
) -> tuple[pd.DataFrame, pd.DataFrame]:
    """Summarize the entities and relationships."""
    entity_summaries, relationship_summaries = await summarize_descriptions(
        entities_df=extracted_entities,
        relationships_df=extracted_relationships,
        model=model,
        max_summary_length=max_summary_length,
        prompt=summarization_prompt,
    )

Claim extraction

Optional workflow that extracts time-bound factual claims:

Claim extraction is disabled by default and requires prompt tuning to be effective. Enable only when your use case specifically requires temporal claims.

FastGraphRAG mode

GraphRAG supports a FastGraphRAG option that uses NLP instead of LLMs for entity/relationship extraction:

Faster processing: No LLM calls for extraction
Lower cost: Only LLM calls for summarization and community reports
Lower quality: NLP extraction is less accurate than LLM extraction
No claims: Claim extraction is always skipped in FastGraphRAG mode

Use FastGraphRAG when cost and speed are prioritized over extraction quality.

Phase 4: Graph augmentation

This phase applies community detection to discover the organizational structure of the knowledge graph.

Community detection workflow

The create_communities workflow applies hierarchical Leiden clustering:

async def run_workflow(
    config: GraphRagConfig,
    context: PipelineRunContext,
) -> WorkflowFunctionOutput:
    relationships = await reader.relationships()
    
    clusters = cluster_graph(
        relationships,
        max_cluster_size=config.cluster_graph.max_cluster_size,
        use_lcc=config.cluster_graph.use_lcc,
        seed=config.cluster_graph.seed,
    )

Algorithm
Configuration
Output structure

Hierarchical Leiden is a community detection algorithm that:

Treats the graph as undirected
Applies Leiden clustering recursively
Creates hierarchy until max_cluster_size is reached
Produces multiple levels of granularity

def hierarchical_leiden(
    edges: list[tuple[str, str, float]],
    max_cluster_size: int = 10,
    random_seed: int | None = 0xDEADBEEF,
) -> list[HierarchicalCluster]:
    return gn.hierarchical_leiden(
        edges=edges,
        max_cluster_size=max_cluster_size,
        seed=random_seed,
        resolution=1.0,
        use_modularity=True,
    )

Key parameters:

cluster_graph:
  max_cluster_size: 10  # Maximum entities per leaf community
  use_lcc: true  # Use largest connected component only
  seed: 0xDEADBEEF  # Random seed for reproducibility

max_cluster_size affects:

Number of hierarchy levels
Granularity of communities
Community report detail

Each community includes:

{
  "id": str,  # Unique community ID
  "level": int,  # Hierarchy level (0 = leaf)
  "title": str,  # "Community {number}"
  "entity_ids": list[str],  # Member entities
  "relationship_ids": list[str],  # Intra-community edges
  "text_unit_ids": list[str],  # Associated text units
  "parent": int,  # Parent community ID
  "children": list[int],  # Child community IDs
}

Largest connected component (LCC)

Optional preprocessing step:

When use_lcc: true, only the largest connected component of the graph is used for community detection. This filters out small disconnected clusters.

Phase 5: Community summarization

This phase generates human-readable summaries for each community.

Report generation

The create_community_reports workflow creates summaries:

Gather community data

For each community, collect:

Entity descriptions
Relationship descriptions
Covariate/claim information (if available)
Text unit context

LLM summarization

The LLM generates a structured report including:

Executive summary
Key entities and their roles
Important relationships
Main themes and topics
Supporting claims

Report storage

Reports are stored with:

Full summary text
Summary embeddings (for global search)
Community metadata
Hierarchy information

Bottom-up approach

Community summarization proceeds from leaf communities upward:

Leaf level (level 0): Summarize individual entities and relationships
Mid levels: Summarize child community reports
Root level: Highest-level summary of entire dataset

This creates coherent summaries at each granularity level.

Configuration

community_reports:
  completion_model_id: "model-id"
  prompt: "community_report_prompt"
  max_length: 2000  # Max tokens for report
  max_input_tokens: 16000  # Context window for report generation

Phase 6: Text embeddings

The final phase generates vector embeddings for semantic search.

Embedding workflows

Text unit embeddings

Embed the text content of each text unit for basic semantic search.

Entity embeddings

Embed entity descriptions for entity-based retrieval in local search.

Community report embeddings

Embed community summaries for global search retrieval.

Vector store integration

Embeddings are written to your configured vector store:

Supported stores
Configuration

Built-in vector store implementations:

LanceDB: Local vector database
Azure AI Search: Cloud vector search service
Azure Cosmos DB: NoSQL database with vector search

Custom vector stores can be registered via the factory pattern.

embeddings:
  vector_store:
    type: lancedb  # or azure_ai_search, cosmosdb
    
  text_unit:
    enabled: true
    embedding_model: "text-embedding-3-small"
    
  entity:
    enabled: true
    embedding_model: "text-embedding-3-small"
    
  community:
    enabled: true
    embedding_model: "text-embedding-3-small"

Embeddings enable the semantic similarity searches that serve as entry points into the knowledge graph during query time.

Pipeline architecture

The indexing engine is built on a flexible workflow system:

Key architectural concepts

Workflows

Workflows are named sequences of operations that can be:

Standard: Built-in workflows like extract_graph, create_communities
Custom: User-defined workflows registered via the factory

Each workflow:

Operates on tables from previous workflows
Produces output tables
Can be run independently or as part of pipeline

LLM caching

Critical for resilience and efficiency:

Cache key: Prompt + parameters uniquely identify requests
Cache hit: Returns stored result instead of API call
Benefits:
- Resilience to network errors
- Idempotent pipeline execution
- Cost savings on reruns

# Cache layer wraps all LLM interactions
model = create_completion(
    model_config,
    cache=context.cache.child("extract_graph"),
    cache_key_creator=cache_key_creator,
)

Factory pattern

GraphRAG uses factories for extensibility:

Language models: Custom model providers
Input readers: Custom document formats
Cache: Custom cache storage
Storage: Custom table storage
Vector stores: Custom vector databases
Workflows: Custom pipeline steps

from graphrag.vector_stores.factory import register_vector_store

register_vector_store("my_store", MyVectorStore)

Running the indexing pipeline

CLI
Python API

# Run with default configuration
uv run poe index --root <data_root>

# With custom config
uv run poe index --root <data_root> --config custom_config.yaml

# Resume from cache
uv run poe index --root <data_root> --resume

from graphrag.api.index import build_index

# Run indexing programmatically
await build_index(
    config=config,
    root_dir=root_dir,
    resume=False,
)

Best practices

Start small

Test with a small dataset first to understand costs, processing time, and output quality before scaling up.

Monitor extraction

Check entity and relationship counts after Phase 3. Low numbers indicate prompt tuning may be needed.

Tune prompts

Use the prompt tuning process to optimize extraction for your domain before processing large datasets.

Configure caching

Ensure LLM caching is enabled and properly configured to handle network issues and enable reruns.

Get Started

Core Concepts

Indexing

Query Engine

Prompt Tuning

Configuration

Guides

Indexing pipeline

Pipeline overview

Phase 1: Compose text units

Document loading

Text chunking

Phase 2: Document processing

Phase 3: Graph extraction

Entity and relationship extraction

Summarization

Claim extraction

Phase 4: Graph augmentation

Community detection workflow

Largest connected component (LCC)

Phase 5: Community summarization

Report generation

Phase 6: Text embeddings

Embedding workflows

Text unit embeddings

Entity embeddings

Community report embeddings

Vector store integration

Pipeline architecture

Key architectural concepts

Running the indexing pipeline

Best practices

Start small

Monitor extraction

Tune prompts

Configure caching

Next steps

Community detection

Retrieval methods

Build docs developers (and LLMs) love

Get Started

Core Concepts

Indexing

Query Engine

Prompt Tuning

Configuration

Guides

​Pipeline overview

​Phase 1: Compose text units

​Document loading

​Text chunking

​Phase 2: Document processing

​Phase 3: Graph extraction

​Entity and relationship extraction

​Summarization

​Claim extraction

​Phase 4: Graph augmentation

​Community detection workflow

​Largest connected component (LCC)

​Phase 5: Community summarization

​Report generation

​Phase 6: Text embeddings

​Embedding workflows

Text unit embeddings

Entity embeddings

Community report embeddings

​Vector store integration

​Pipeline architecture

​Key architectural concepts

​Running the indexing pipeline

​Best practices

Start small

Monitor extraction

Tune prompts

Configure caching

​Next steps

Community detection

Retrieval methods

Build docs developers (and LLMs) love

Pipeline overview

Phase 1: Compose text units

Document loading

Text chunking

Phase 2: Document processing

Phase 3: Graph extraction

Entity and relationship extraction

Summarization

Claim extraction

Phase 4: Graph augmentation

Community detection workflow

Largest connected component (LCC)

Phase 5: Community summarization

Report generation

Phase 6: Text embeddings

Embedding workflows

Vector store integration

Pipeline architecture

Key architectural concepts

Running the indexing pipeline

Best practices

Next steps