Knowledge graphs

Knowledge graphs are the foundation of GraphRAG’s ability to reason about complex information. Unlike traditional RAG approaches that treat documents as unstructured text, GraphRAG uses LLMs to extract explicit entities and relationships, creating a structured graph representation of your data.

What is a knowledge graph?

A knowledge graph is a structured representation of information where:

Nodes represent entities (people, places, organizations, events, concepts)
Edges represent relationships between entities
Attributes provide descriptive information about nodes and edges

In GraphRAG, this structure enables sophisticated reasoning by making the connections between concepts explicit and traversable.

Example of an LLM-generated knowledge graph

An LLM-generated knowledge graph showing entities (circles) sized by degree, with colors representing community membership

Graph extraction process

GraphRAG builds knowledge graphs through a multi-step extraction and refinement pipeline.

Entity extraction

Entities are extracted from each text unit using LLM-based analysis. The extraction process identifies:

Entity attributes
Entity types
Implementation

Title: The canonical name of the entity

"title": "Microsoft Corporation"

Type: The category of entity (configurable)

"type": "ORGANIZATION"

Description: Contextual information about the entity

"description": "A multinational technology company..."

Text unit references: Links back to source text

"text_unit_ids": ["unit_001", "unit_042", "unit_127"]

The extraction prompt can be configured to focus on specific entity types relevant to your domain:Default types:

PERSON
ORGANIZATION
LOCATION
EVENT
CONCEPT

Custom types: You can define domain-specific entity types through prompt tuning:

MEDICATION (healthcare)
PROTEIN (biology)
COMPONENT (engineering)
LEGAL_CASE (law)

Entity type configuration is done through the entity_types parameter in the extraction config and can be refined using the prompt tuning process.

The entity extraction is implemented in the extract_graph workflow:

# From extract_graph.py
extracted_entities, extracted_relationships = await extractor(
    text_units=text_units,
    text_column="text",
    id_column="id",
    model=extraction_model,
    prompt=extraction_prompt,
    entity_types=entity_types,
    max_gleanings=max_gleanings,
    num_threads=extraction_num_threads,
)

Each text unit is processed independently, then entities with the same title and type are merged across units.

Relationship extraction

Relationships connect entities and capture the semantic connections in your data.

Relationship structure

Each relationship contains:Source entity: The starting point of the relationship

"source": "Microsoft Corporation"

Target entity: The ending point of the relationship

"target": "Satya Nadella"

Description: The nature and context of the relationship

"description": "Satya Nadella is the CEO of Microsoft Corporation"

Weight: Strength or importance of the relationship (derived from frequency and context)

"weight": 8.5

Text unit IDs: Source references for the relationship

"text_unit_ids": ["unit_001", "unit_042"]

Relationship merging

When the same relationship appears in multiple text units:

Collection: All descriptions are gathered into a list
Deduplication: Identical descriptions are removed
Summarization: The LLM creates a single concise description capturing all distinct information
Weight calculation: Frequency and context determine relationship strength

This ensures each entity pair has a single, comprehensive relationship description.

Directional vs undirected

Extraction: Relationships are extracted with explicit direction (source → target)
Community detection: The graph is treated as undirected for clustering
Querying: Both directions are considered when traversing relationships

# Edge normalization for community detection
lo = edge_df[["source", "target"]].min(axis=1)
hi = edge_df[["source", "target"]].max(axis=1)
edge_df["source"] = lo
edge_df["target"] = hi
edge_df.drop_duplicates(subset=["source", "target"], keep="last")

Entity and relationship summarization

After extraction, entities and relationships often have multiple descriptions from different text units. The summarization phase consolidates these:

Collect descriptions

Entities and relationships with the same identity gather all their descriptions from different text units into lists.

LLM summarization

The summarization model receives all descriptions and generates a single concise summary that captures all distinct information:

entity_summaries, relationship_summaries = await summarize_descriptions(
    entities_df=extracted_entities,
    relationships_df=extracted_relationships,
    model=summarization_model,
    max_summary_length=max_summary_length,
    prompt=summarization_prompt,
)

Replace descriptions

Original description lists are replaced with the summarized versions, creating clean, consistent descriptions across the knowledge graph.

Summarization is crucial for managing token counts in downstream queries and ensuring each entity/relationship has coherent, non-redundant descriptions.

Claim extraction (covariates)

Beyond entities and relationships, GraphRAG can extract claims—factual statements about entities that may be time-bound.

What are claims?
When to use claims
Configuration

Claims (called “covariates” in the data model) are assertions about entities with specific properties:

Subject: The entity the claim is about
Object: What is being claimed
Type: Category of claim
Status: Validity or confidence level
Start/End date: Time bounds when applicable
Description: Full context of the claim
Source references: Links to supporting text units

Example:

{
  "subject": "Microsoft",
  "object": "acquired GitHub",
  "type": "ACQUISITION",
  "status": "CONFIRMED",
  "start_date": "2018-06-04",
  "description": "Microsoft acquired GitHub for $7.5 billion"
}

Claims are extracted in the extract_covariates workflow, separate from entity/relationship extraction:

# In configuration
extract_covariates:
  enabled: false  # Disabled by default
  completion_model_id: "model-id"
  prompt: "covariates_extraction_prompt"
  max_gleanings: 1

The extraction analyzes the same text units but focuses on factual claims rather than graph structure.

Graph properties and metrics

Once extracted, the knowledge graph has several important properties:

Entity ranking

Entities are ranked by importance using graph metrics:

Degree: Number of relationships connected to the entity
Centrality: Position in the network (highly connected entities have higher centrality)
Community membership: Which communities the entity belongs to at different hierarchy levels

# Entity model attributes
rank: int = 1  # Higher rank = more important entity
community_ids: list[str]  # Community memberships

Entity rank influences:

Prioritization in local search results
Size of nodes in graph visualizations
Context window allocation during retrieval

Graph structure

The complete graph structure includes:

Connectivity

Entities connected by relationships form a network where information can be traversed through multi-hop paths.

Clustering

Community detection reveals groups of densely connected entities, representing coherent topics or themes.

Hierarchy

Multiple levels of communities create a hierarchical organization from global themes to local clusters.

Provenance

Every entity and relationship maintains links to source text units and documents for verification.

From graph to retrieval

The knowledge graph enables sophisticated retrieval strategies:

Entity-based entry points: Queries identify relevant entities through semantic similarity
Graph traversal: Related entities and relationships are retrieved by following edges
Community context: Entities’ community memberships provide broader thematic context
Multi-hop reasoning: Connections can be followed multiple steps to gather comprehensive information

The next concept page on community detection explores how hierarchical clustering organizes the knowledge graph into meaningful structures.

Best practices

Entity type design

Start with general types (PERSON, ORGANIZATION, LOCATION)
Use prompt tuning to identify domain-specific types
Keep types consistent and well-defined
Avoid too many types (5-10 is usually sufficient)

Extraction quality

Use appropriate text unit sizes (1200 tokens default)
Configure max_gleanings (1-2) for iterative refinement
Tune prompts for your domain using the prompt tuning guide
Review sample extractions before processing large datasets

Graph validation

Check entity and relationship counts after extraction
Review high-degree entities for quality
Validate that relationships make semantic sense
Ensure text unit references are preserved

Next steps

Indexing pipeline

See how graph extraction fits into the full indexing workflow

Community detection

Learn how hierarchical clustering organizes the graph

Get Started

Core Concepts

Indexing

Query Engine

Prompt Tuning

Configuration

Guides

What is a knowledge graph?

Graph extraction process

Entity extraction

Relationship extraction

Entity and relationship summarization

Claim extraction (covariates)

Graph properties and metrics

Entity ranking

Graph structure

Connectivity

Clustering

Hierarchy

Provenance

From graph to retrieval

Best practices

Next steps

Indexing pipeline

Community detection

Build docs developers (and LLMs) love

Get Started

Core Concepts

Indexing

Query Engine

Prompt Tuning

Configuration

Guides

​What is a knowledge graph?

​Graph extraction process

​Entity extraction

​Relationship extraction

​Entity and relationship summarization

​Claim extraction (covariates)

​Graph properties and metrics

​Entity ranking

​Graph structure

Connectivity

Clustering

Hierarchy

Provenance

​From graph to retrieval

​Best practices

​Next steps

Indexing pipeline

Community detection

Build docs developers (and LLMs) love

What is a knowledge graph?

Graph extraction process

Entity extraction

Relationship extraction

Entity and relationship summarization

Claim extraction (covariates)

Graph properties and metrics

Entity ranking

Graph structure

From graph to retrieval

Best practices

Next steps