Skip to main content
The default pipeline produces a series of output tables that align with the GraphRAG knowledge model. By default, these tables are written as Parquet files to disk.
All output tables include embeddings written directly to your configured vector store for efficient downstream retrieval.

Shared fields

All tables have two identifier fields for global uniqueness and human readability:
FieldTypeDescription
idstrGenerated UUID, ensuring global uniqueness across all records
human_readable_idintIncremented short ID created per-run. Used in generated summaries with citations for easy visual cross-reference

Communities

This table contains the final communities generated by the Leiden algorithm. Communities are strictly hierarchical, subdividing into children as cluster affinity is narrowed.
FieldTypeDescription
communityintLeiden-generated cluster ID for the community. These increment with depth and are unique through all levels of the hierarchy. For this table, human_readable_id is a copy of the community ID
parentintParent community ID
childrenint[]List of child community IDs
levelintDepth of the community in the hierarchy
titlestrFriendly name of the community
entity_idsstr[]List of entities that are members of the community
relationship_idsstr[]List of relationships wholly within the community (source and target both in community)
text_unit_idsstr[]List of text units represented within the community
periodstrDate of ingest in ISO8601 format, used for incremental update merges
sizeintSize of the community (entity count), used for incremental update merges
import pandas as pd

communities = pd.read_parquet("output/communities.parquet")
print(communities.head())

# Sample output:
#   id                    community  parent  children      level  title               entity_ids          relationship_ids
#   abc123-def456-...     0          -1      [1, 2, 3]     0      Community 0         [ent1, ent2, ...]  [rel1, rel2, ...]
#   def456-ghi789-...     1          0       []            1      Community 1         [ent3, ent4, ...]  [rel3, rel4, ...]

Community reports

This table contains the summarized reports for each community, generated by the LLM.
FieldTypeDescription
communityintShort ID of the community this report applies to
parentintParent community ID
childrenint[]List of child community IDs
levelintLevel of the community this report applies to
titlestrLLM-generated title for the report
summarystrLLM-generated summary of the report
full_contentstrLLM-generated full report
rankfloatLLM-derived relevance ranking based on member entity salience
rating_explanationstrLLM-derived explanation of the rank
findingsdictLLM-derived list of the top 5-10 insights from the community. Contains summary and explanation values
full_content_jsonjsonFull JSON output as returned by the LLM. Most fields are extracted into columns, but this JSON is sent for query summarization to allow prompt tuning to add fields/content
periodstrDate of ingest in ISO8601 format, used for incremental update merges
sizeintSize of the community (entity count), used for incremental update merges
import pandas as pd

reports = pd.read_parquet("output/community_reports.parquet")
print(reports[['community', 'title', 'summary']].head())

# Sample output:
#   community  title                           summary
#   0          Global Technology Ecosystem     This community represents major technology companies...
#   1          Social Media Platforms          A focused group of social networking services...

Covariates

This optional table is generated when claim extraction is enabled. Claims typically identify malicious behavior such as fraud, so they are not useful for all datasets.
Claim extraction is off by default and requires configuration to enable.
FieldTypeDescription
covariate_typestrAlways “claim” with default covariates
typestrNature of the claim type
descriptionstrLLM-generated description of the behavior
subject_idstrName of the source entity (performing the claimed behavior)
object_idstrName of the target entity (behavior is performed on)
statusstrLLM-derived assessment of correctness. One of: TRUE, FALSE, SUSPECTED
start_datestrLLM-derived start of the claimed activity (ISO8601)
end_datestrLLM-derived end of the claimed activity (ISO8601)
source_textstrShort string of text containing the claimed behavior
text_unit_idstrID of the text unit the claim was extracted from
import pandas as pd

covariates = pd.read_parquet("output/covariates.parquet")
print(covariates[['subject_id', 'type', 'status', 'description']].head())

# Sample output:
#   subject_id    type           status      description
#   Company A     ACQUISITION    TRUE        Company A acquired Company B for $10B
#   Person X      FRAUD          SUSPECTED   Person X allegedly misused funds

Documents

This table contains the list of document content after import.
FieldTypeDescription
titlestrFilename, unless otherwise configured during CSV/JSON import
textstrFull text of the document
text_unit_idsstr[]List of text units (chunks) that were parsed from the document
metadatadictIf specified during CSV/JSON import, this is a dict of metadata for the document
import pandas as pd

documents = pd.read_parquet("output/documents.parquet")
print(documents[['title', 'text_unit_ids']].head())

# Sample output:
#   title               text_unit_ids
#   article1.txt        [unit1, unit2, unit3]
#   article2.txt        [unit4, unit5]

Entities

This table contains all entities found in the data by the LLM.
FieldTypeDescription
titlestrName of the entity
typestrType of the entity. By default: “organization”, “person”, “geo”, or “event” (unless configured differently or auto-tuning is used)
descriptionstrTextual description of the entity. Since entities may be found in many text units, this is an LLM-derived summary of all descriptions
text_unit_idsstr[]List of the text units containing the entity
frequencyintCount of text units the entity was found within
degreeintNode degree (connectedness) in the graph
import pandas as pd

entities = pd.read_parquet("output/entities.parquet")
print(entities[['title', 'type', 'description', 'degree']].head())

# Sample output:
#   title              type          description                                      degree
#   Microsoft          organization  A multinational technology corporation...        42
#   Satya Nadella      person        CEO of Microsoft Corporation...                  18
#   Seattle            geo           City in Washington state, headquarters...        15

Relationships

This table contains all entity-to-entity relationships found in the data by the LLM. This is also the edge list for the graph.
FieldTypeDescription
sourcestrName of the source entity
targetstrName of the target entity
descriptionstrLLM-derived description of the relationship. Like entity descriptions, this is summarized from multiple instances
weightfloatWeight of the edge in the graph. Summed from an LLM-derived “strength” measure for each relationship instance
combined_degreeintSum of source and target node degrees
text_unit_idsstr[]List of text units the relationship was found within
import pandas as pd

relationships = pd.read_parquet("output/relationships.parquet")
print(relationships[['source', 'target', 'description', 'weight']].head())

# Sample output:
#   source          target           description                          weight
#   Microsoft       Azure            Microsoft develops and operates...   0.95
#   Satya Nadella   Microsoft        Satya Nadella serves as CEO of...   0.98
#   Microsoft       OpenAI           Microsoft has invested in and...    0.87

Text units

This table contains all text chunks parsed from the input documents.
FieldTypeDescription
textstrRaw full text of the chunk
n_tokensintNumber of tokens in the chunk. Should normally match the chunk_size config parameter, except for the last chunk which is often shorter
document_idstrID of the document the chunk came from
entity_idsstr[]List of entities found in the text unit
relationship_idsstr[]List of relationships found in the text unit
covariate_idsstr[]Optional list of covariates found in the text unit
import pandas as pd

text_units = pd.read_parquet("output/text_units.parquet")
print(text_units[['text', 'n_tokens', 'entity_ids']].head())

# Sample output:
#   text                                          n_tokens  entity_ids
#   Microsoft Corporation is a technology...      1200      [Microsoft, Bill Gates, ...]
#   The company was founded in 1975...            1200      [Microsoft, Paul Allen, ...]
#   Azure is Microsoft's cloud computing...       850       [Azure, Microsoft, ...]

Working with Parquet files

import pandas as pd

# Read a single table
entities = pd.read_parquet("output/entities.parquet")
relationships = pd.read_parquet("output/relationships.parquet")

# Filter and analyze
high_degree_entities = entities[entities['degree'] > 10]
print(f"Found {len(high_degree_entities)} highly connected entities")

Storage locations

By default, Parquet files are written to the output directory specified in your configuration:
settings.yaml
storage:
  type: file
  base_dir: "output"
storage:
  type: file
  base_dir: "output"
Files are written to:
  • output/entities.parquet
  • output/relationships.parquet
  • output/communities.parquet
  • etc.

Next steps

Custom graphs

Learn how to bring your own existing graph data

Querying

Use the output tables for GraphRAG queries

Configuration

Configure storage providers and output settings

Build docs developers (and LLMs) love