Indexing dataflow

This page provides a detailed overview of how the default-configuration workflow transforms text documents into the GraphRAG Knowledge Model.

Dataflow overview

The indexing pipeline consists of six major phases:

Phase 1: Compose TextUnits

The first phase transforms input documents into TextUnits. A TextUnit is a chunk of text used for graph extraction techniques and source references.

The chunk size (counted in tokens) is user-configurable. By default this is set to 1200 tokens.

Chunking considerations

Larger chunks

Lower-fidelity output
Less meaningful references
Much faster processing

Smaller chunks

Higher-fidelity output
More meaningful references
Slower processing

Phase 2: Document processing

In this phase, the Documents table is created for the knowledge model. Documents are linked to their constituent text units for provenance tracking.

Link to TextUnits

This step links each document to the text-units created in Phase 1, establishing bidirectional relationships between documents and their chunks.

Phase 3: Graph extraction

In this phase, each text unit is analyzed to extract graph primitives: Entities, Relationships, and Claims.

If you are using FastGraphRAG, entity and relationship extraction will be performed using NLP to conserve LLM resources, and claim extraction will always be skipped.

Entity and relationship extraction

The first step processes each text-unit to extract entities and relationships using the LLM.

Extract from text units

Each text unit is processed to extract:

Entities with a title, type, and description
Relationships with a source, target, and description

Merge subgraphs

Subgraphs are merged together:

Entities with the same title and type are merged by creating an array of descriptions
Relationships with the same source and target are merged by creating an array of descriptions

# From: graphrag/index/workflows/extract_graph.py
async def extract_graph(
    text_units: pd.DataFrame,
    callbacks: WorkflowCallbacks,
    extraction_model: "LLMCompletion",
    extraction_prompt: str,
    entity_types: list[str],
    max_gleanings: int,
    extraction_num_threads: int,
    extraction_async_type: AsyncType,
    summarization_model: "LLMCompletion",
    max_summary_length: int,
    max_input_tokens: int,
    summarization_prompt: str,
    summarization_num_threads: int,
) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """All the steps to create the base entity graph."""
    # Extract entities and relationships from text units
    extracted_entities, extracted_relationships = await extractor(
        text_units=text_units,
        callbacks=callbacks,
        text_column="text",
        id_column="id",
        model=extraction_model,
        prompt=extraction_prompt,
        entity_types=entity_types,
        max_gleanings=max_gleanings,
        num_threads=extraction_num_threads,
        async_type=extraction_async_type,
    )

Entity and relationship summarization

Once the graph is built, each entity and relationship has a list of descriptions that are summarized into a single concise description using the LLM.

This allows all entities and relationships to have a single concise description that captures all distinct information.

Claim extraction (optional)

Claims are extracted as an independent workflow from the source TextUnits. These claims represent positive factual statements with an evaluated status and time-bounds.

Claim extraction is optional and turned off by default. This feature generally requires prompt tuning to be useful for your specific use case.

The claims get exported as a primary artifact called Covariates.

Phase 4: Graph augmentation

Now that we have a usable graph of entities and relationships, the system understands their community structure using hierarchical clustering.

Community detection

This step generates a hierarchy of entity communities using the Hierarchical Leiden Algorithm.

This method applies recursive community-clustering to the graph until reaching a community-size threshold. This provides a way to navigate and summarize the graph at different levels of granularity.

# From: graphrag/index/operations/cluster_graph.py
def cluster_graph(
    edges: pd.DataFrame,
    max_cluster_size: int,
    use_lcc: bool,
    seed: int | None = None,
) -> Communities:
    """Apply a hierarchical clustering algorithm to a relationships DataFrame."""
    node_id_to_community_map, parent_mapping = _compute_leiden_communities(
        edges=edges,
        max_cluster_size=max_cluster_size,
        use_lcc=use_lcc,
        seed=seed,
    )
    
    # Build community hierarchy
    levels = sorted(node_id_to_community_map.keys())
    clusters: dict[int, dict[int, list[str]]] = {}
    
    for level in levels:
        result: dict[int, list[str]] = defaultdict(list)
        clusters[level] = result
        for node_id, community_id in node_id_to_community_map[level].items():
            result[community_id].append(node_id)
    
    return results

Graph tables

Once graph augmentation is complete, the final Entities, Relationships, and Communities tables are exported.

Phase 5: Community summarization

Community reports are generated for each community in the hierarchy, providing high-level understanding at various levels of granularity.

Generate community reports

A summary is generated for each community using the LLM. These reports contain:

Executive overview

High-level summary of the community’s content and significance

Key entities

Reference to important entities within the community

Relationships

Important connections between entities in the community

Claims

Relevant claims extracted from the community (if enabled)

Summarize community reports

Each community report is then summarized via the LLM for shorthand use in queries.

Community reports table

At this point, bookkeeping work is performed and the Community Reports table is exported.

Phase 6: Text embedding

For all artifacts that require downstream vector search, text embeddings are generated as a final step.

Embeddings are written directly to a configured vector store.

Default embedding targets

By default, the following are embedded:

Entity descriptions - For entity-based vector search
Text unit text - For chunk-based retrieval
Community report text - For high-level semantic search

Next steps

Outputs

Learn about the Parquet output schemas

Methods

Compare Standard and FastGraphRAG indexing

Configuration

Configure chunk size, prompts, and more

Get Started

Core Concepts

Indexing

Query Engine

Prompt Tuning

Configuration

Guides

Indexing dataflow

Dataflow overview

Phase 1: Compose TextUnits

Chunking considerations

Larger chunks

Smaller chunks

Phase 2: Document processing

Link to TextUnits

Phase 3: Graph extraction

Entity and relationship extraction

Entity and relationship summarization

Claim extraction (optional)

Phase 4: Graph augmentation

Community detection

Graph tables

Phase 5: Community summarization

Generate community reports

Executive overview

Key entities

Relationships

Claims

Summarize community reports

Community reports table

Phase 6: Text embedding

Default embedding targets

Next steps

Outputs

Methods

Configuration

Build docs developers (and LLMs) love

Get Started

Core Concepts

Indexing

Query Engine

Prompt Tuning

Configuration

Guides

​Dataflow overview

​Phase 1: Compose TextUnits

​Chunking considerations

Larger chunks

Smaller chunks

​Phase 2: Document processing

​Link to TextUnits

​Phase 3: Graph extraction

​Entity and relationship extraction

​Entity and relationship summarization

​Claim extraction (optional)

​Phase 4: Graph augmentation

​Community detection

​Graph tables

​Phase 5: Community summarization

​Generate community reports

Executive overview

Key entities

Relationships

Claims

​Summarize community reports

​Community reports table

​Phase 6: Text embedding

​Default embedding targets

​Next steps

Outputs

Methods

Configuration

Build docs developers (and LLMs) love

Dataflow overview

Phase 1: Compose TextUnits

Chunking considerations

Phase 2: Document processing

Link to TextUnits

Phase 3: Graph extraction

Entity and relationship extraction

Entity and relationship summarization

Claim extraction (optional)

Phase 4: Graph augmentation

Community detection

Graph tables

Phase 5: Community summarization

Generate community reports

Summarize community reports

Community reports table

Phase 6: Text embedding

Default embedding targets

Next steps