Skip to main content
The GraphRAG indexing pipeline is a configurable data transformation suite that extracts meaningful, structured data from unstructured text using LLMs. The pipeline is composed of workflows, standard and custom steps, prompt templates, and input/output adapters.

Pipeline overview

The indexing process transforms raw documents through six distinct phases:

Phase 1: Compose text units

The first phase transforms input documents into analyzable text chunks called text units.

Document loading

GraphRAG supports multiple input formats through configurable input readers:
Built-in readers:
  • Text files (.txt): Individual documents
  • CSV files: Rows as documents with configurable text columns
  • JSON files: Structured document collections
Custom readers: You can implement custom input readers for other formats:
# Register a custom input reader
from graphrag.index.input.factory import register_input_reader

register_input_reader("my_format", MyCustomReader)

Text chunking

Documents are split into text units with configurable parameters:
Default: 1200 tokensThe size of each text unit affects:
  • Extraction quality: Smaller chunks provide more focused extraction but may miss broader context
  • Processing speed: Larger chunks reduce the number of LLM calls but may be less precise
  • Reference granularity: Smaller chunks give finer-grained source citations
chunks:
  size: 1200  # tokens
  overlap: 100  # tokens
  group_by_columns: ["id"]  # optional: group chunks by document attributes
Text units can overlap to preserve context across chunk boundaries:
  • No overlap: Chunks are completely independent (faster, may miss boundary context)
  • Moderate overlap (100-200 tokens): Balances context preservation with efficiency
  • High overlap (300+ tokens): Maximum context but more redundancy and cost
The overlap ensures entities or relationships spanning chunk boundaries are captured.
Chunking is performed in the create_base_text_units workflow:
# Text units creation
text_units = documents.chunk(
    size=chunk_size,
    overlap=chunk_overlap,
    encoding=token_encoder
)
Each text unit receives:
  • Unique ID
  • Text content
  • Token count
  • Document ID reference
  • Position in source document
Text units serve dual purposes: they are the analysis units for extraction AND the source references that enable provenance tracking.

Phase 2: Document processing

This phase creates the final documents table by linking documents to their constituent text units.
1

Document enrichment

Original document metadata is preserved:
  • Document ID
  • Title or filename
  • Timestamps (if available)
  • Custom attributes
2

Text unit linking

Each document is linked to all text units created from it:
document.text_unit_ids = ["unit_001", "unit_002", "unit_003"]
3

Table export

The documents table is exported as Parquet for downstream use and provenance.

Phase 3: Graph extraction

This is the core knowledge extraction phase where entities, relationships, and claims are extracted from text units.

Entity and relationship extraction

The extract_graph workflow processes each text unit:
async def run_workflow(
    config: GraphRagConfig,
    context: PipelineRunContext,
) -> WorkflowFunctionOutput:
    text_units = await reader.text_units()
    
    # Extract entities and relationships
    entities, relationships, raw_entities, raw_relationships = await extract_graph(
        text_units=text_units,
        extraction_model=extraction_model,
        extraction_prompt=extraction_prompts.extraction_prompt,
        entity_types=config.extract_graph.entity_types,
        max_gleanings=config.extract_graph.max_gleanings,
    )

Summarization

After merging, entities and relationships have multiple descriptions that need consolidation:
Summarization reduces token counts and creates coherent, non-redundant descriptions for each entity and relationship.
async def get_summarized_entities_relationships(
    extracted_entities: pd.DataFrame,
    extracted_relationships: pd.DataFrame,
    model: LLMCompletion,
    max_summary_length: int,
    summarization_prompt: str,
) -> tuple[pd.DataFrame, pd.DataFrame]:
    """Summarize the entities and relationships."""
    entity_summaries, relationship_summaries = await summarize_descriptions(
        entities_df=extracted_entities,
        relationships_df=extracted_relationships,
        model=model,
        max_summary_length=max_summary_length,
        prompt=summarization_prompt,
    )

Claim extraction

Optional workflow that extracts time-bound factual claims:
Claim extraction is disabled by default and requires prompt tuning to be effective. Enable only when your use case specifically requires temporal claims.
GraphRAG supports a FastGraphRAG option that uses NLP instead of LLMs for entity/relationship extraction:
  • Faster processing: No LLM calls for extraction
  • Lower cost: Only LLM calls for summarization and community reports
  • Lower quality: NLP extraction is less accurate than LLM extraction
  • No claims: Claim extraction is always skipped in FastGraphRAG mode
Use FastGraphRAG when cost and speed are prioritized over extraction quality.

Phase 4: Graph augmentation

This phase applies community detection to discover the organizational structure of the knowledge graph.

Community detection workflow

The create_communities workflow applies hierarchical Leiden clustering:
async def run_workflow(
    config: GraphRagConfig,
    context: PipelineRunContext,
) -> WorkflowFunctionOutput:
    relationships = await reader.relationships()
    
    clusters = cluster_graph(
        relationships,
        max_cluster_size=config.cluster_graph.max_cluster_size,
        use_lcc=config.cluster_graph.use_lcc,
        seed=config.cluster_graph.seed,
    )
Hierarchical Leiden is a community detection algorithm that:
  1. Treats the graph as undirected
  2. Applies Leiden clustering recursively
  3. Creates hierarchy until max_cluster_size is reached
  4. Produces multiple levels of granularity
def hierarchical_leiden(
    edges: list[tuple[str, str, float]],
    max_cluster_size: int = 10,
    random_seed: int | None = 0xDEADBEEF,
) -> list[HierarchicalCluster]:
    return gn.hierarchical_leiden(
        edges=edges,
        max_cluster_size=max_cluster_size,
        seed=random_seed,
        resolution=1.0,
        use_modularity=True,
    )

Largest connected component (LCC)

Optional preprocessing step:
When use_lcc: true, only the largest connected component of the graph is used for community detection. This filters out small disconnected clusters.

Phase 5: Community summarization

This phase generates human-readable summaries for each community.

Report generation

The create_community_reports workflow creates summaries:
1

Gather community data

For each community, collect:
  • Entity descriptions
  • Relationship descriptions
  • Covariate/claim information (if available)
  • Text unit context
2

LLM summarization

The LLM generates a structured report including:
  • Executive summary
  • Key entities and their roles
  • Important relationships
  • Main themes and topics
  • Supporting claims
3

Report storage

Reports are stored with:
  • Full summary text
  • Summary embeddings (for global search)
  • Community metadata
  • Hierarchy information
Community summarization proceeds from leaf communities upward:
  1. Leaf level (level 0): Summarize individual entities and relationships
  2. Mid levels: Summarize child community reports
  3. Root level: Highest-level summary of entire dataset
This creates coherent summaries at each granularity level.
community_reports:
  completion_model_id: "model-id"
  prompt: "community_report_prompt"
  max_length: 2000  # Max tokens for report
  max_input_tokens: 16000  # Context window for report generation

Phase 6: Text embeddings

The final phase generates vector embeddings for semantic search.

Embedding workflows

Text unit embeddings

Embed the text content of each text unit for basic semantic search.

Entity embeddings

Embed entity descriptions for entity-based retrieval in local search.

Community report embeddings

Embed community summaries for global search retrieval.

Vector store integration

Embeddings are written to your configured vector store:
Built-in vector store implementations:
  • LanceDB: Local vector database
  • Azure AI Search: Cloud vector search service
  • Azure Cosmos DB: NoSQL database with vector search
Custom vector stores can be registered via the factory pattern.
Embeddings enable the semantic similarity searches that serve as entry points into the knowledge graph during query time.

Pipeline architecture

The indexing engine is built on a flexible workflow system:

Key architectural concepts

Workflows are named sequences of operations that can be:
  • Standard: Built-in workflows like extract_graph, create_communities
  • Custom: User-defined workflows registered via the factory
Each workflow:
  • Operates on tables from previous workflows
  • Produces output tables
  • Can be run independently or as part of pipeline
Critical for resilience and efficiency:
  • Cache key: Prompt + parameters uniquely identify requests
  • Cache hit: Returns stored result instead of API call
  • Benefits:
    • Resilience to network errors
    • Idempotent pipeline execution
    • Cost savings on reruns
# Cache layer wraps all LLM interactions
model = create_completion(
    model_config,
    cache=context.cache.child("extract_graph"),
    cache_key_creator=cache_key_creator,
)
GraphRAG uses factories for extensibility:
  • Language models: Custom model providers
  • Input readers: Custom document formats
  • Cache: Custom cache storage
  • Storage: Custom table storage
  • Vector stores: Custom vector databases
  • Workflows: Custom pipeline steps
Register custom implementations:
from graphrag.vector_stores.factory import register_vector_store

register_vector_store("my_store", MyVectorStore)

Running the indexing pipeline

# Run with default configuration
uv run poe index --root <data_root>

# With custom config
uv run poe index --root <data_root> --config custom_config.yaml

# Resume from cache
uv run poe index --root <data_root> --resume

Best practices

Start small

Test with a small dataset first to understand costs, processing time, and output quality before scaling up.

Monitor extraction

Check entity and relationship counts after Phase 3. Low numbers indicate prompt tuning may be needed.

Tune prompts

Use the prompt tuning process to optimize extraction for your domain before processing large datasets.

Configure caching

Ensure LLM caching is enabled and properly configured to handle network issues and enable reruns.

Next steps

Community detection

Deep dive into hierarchical Leiden clustering

Retrieval methods

Learn how indexed data powers different search strategies

Build docs developers (and LLMs) love