Dataflow overview
The indexing pipeline consists of six major phases:Phase 1: Compose TextUnits
The first phase transforms input documents into TextUnits. A TextUnit is a chunk of text used for graph extraction techniques and source references.The chunk size (counted in tokens) is user-configurable. By default this is set to 1200 tokens.
Chunking considerations
Larger chunks
- Lower-fidelity output
- Less meaningful references
- Much faster processing
Smaller chunks
- Higher-fidelity output
- More meaningful references
- Slower processing
Phase 2: Document processing
In this phase, the Documents table is created for the knowledge model. Documents are linked to their constituent text units for provenance tracking.Link to TextUnits
This step links each document to the text-units created in Phase 1, establishing bidirectional relationships between documents and their chunks.Phase 3: Graph extraction
In this phase, each text unit is analyzed to extract graph primitives: Entities, Relationships, and Claims.Entity and relationship extraction
The first step processes each text-unit to extract entities and relationships using the LLM.Extract from text units
Each text unit is processed to extract:
- Entities with a title, type, and description
- Relationships with a source, target, and description
Entity and relationship summarization
Once the graph is built, each entity and relationship has a list of descriptions that are summarized into a single concise description using the LLM.This allows all entities and relationships to have a single concise description that captures all distinct information.
Claim extraction (optional)
Claims are extracted as an independent workflow from the source TextUnits. These claims represent positive factual statements with an evaluated status and time-bounds. The claims get exported as a primary artifact called Covariates.Phase 4: Graph augmentation
Now that we have a usable graph of entities and relationships, the system understands their community structure using hierarchical clustering.Community detection
This step generates a hierarchy of entity communities using the Hierarchical Leiden Algorithm.This method applies recursive community-clustering to the graph until reaching a community-size threshold. This provides a way to navigate and summarize the graph at different levels of granularity.
Graph tables
Once graph augmentation is complete, the final Entities, Relationships, and Communities tables are exported.Phase 5: Community summarization
Community reports are generated for each community in the hierarchy, providing high-level understanding at various levels of granularity.Generate community reports
A summary is generated for each community using the LLM. These reports contain:Executive overview
High-level summary of the community’s content and significance
Key entities
Reference to important entities within the community
Relationships
Important connections between entities in the community
Claims
Relevant claims extracted from the community (if enabled)
Summarize community reports
Each community report is then summarized via the LLM for shorthand use in queries.Community reports table
At this point, bookkeeping work is performed and the Community Reports table is exported.Phase 6: Text embedding
For all artifacts that require downstream vector search, text embeddings are generated as a final step.Embeddings are written directly to a configured vector store.
Default embedding targets
By default, the following are embedded:- Entity descriptions - For entity-based vector search
- Text unit text - For chunk-based retrieval
- Community report text - For high-level semantic search
Next steps
Outputs
Learn about the Parquet output schemas
Methods
Compare Standard and FastGraphRAG indexing
Configuration
Configure chunk size, prompts, and more