Skip to main content
Knowledge graphs are the foundation of GraphRAG’s ability to reason about complex information. Unlike traditional RAG approaches that treat documents as unstructured text, GraphRAG uses LLMs to extract explicit entities and relationships, creating a structured graph representation of your data.

What is a knowledge graph?

A knowledge graph is a structured representation of information where:
  • Nodes represent entities (people, places, organizations, events, concepts)
  • Edges represent relationships between entities
  • Attributes provide descriptive information about nodes and edges
In GraphRAG, this structure enables sophisticated reasoning by making the connections between concepts explicit and traversable.
Example of an LLM-generated knowledge graph

An LLM-generated knowledge graph showing entities (circles) sized by degree, with colors representing community membership

Graph extraction process

GraphRAG builds knowledge graphs through a multi-step extraction and refinement pipeline.

Entity extraction

Entities are extracted from each text unit using LLM-based analysis. The extraction process identifies:
Title: The canonical name of the entity
"title": "Microsoft Corporation"
Type: The category of entity (configurable)
"type": "ORGANIZATION"
Description: Contextual information about the entity
"description": "A multinational technology company..."
Text unit references: Links back to source text
"text_unit_ids": ["unit_001", "unit_042", "unit_127"]

Relationship extraction

Relationships connect entities and capture the semantic connections in your data.
Each relationship contains:Source entity: The starting point of the relationship
"source": "Microsoft Corporation"
Target entity: The ending point of the relationship
"target": "Satya Nadella"
Description: The nature and context of the relationship
"description": "Satya Nadella is the CEO of Microsoft Corporation"
Weight: Strength or importance of the relationship (derived from frequency and context)
"weight": 8.5
Text unit IDs: Source references for the relationship
"text_unit_ids": ["unit_001", "unit_042"]
When the same relationship appears in multiple text units:
  1. Collection: All descriptions are gathered into a list
  2. Deduplication: Identical descriptions are removed
  3. Summarization: The LLM creates a single concise description capturing all distinct information
  4. Weight calculation: Frequency and context determine relationship strength
This ensures each entity pair has a single, comprehensive relationship description.
  • Extraction: Relationships are extracted with explicit direction (source → target)
  • Community detection: The graph is treated as undirected for clustering
  • Querying: Both directions are considered when traversing relationships
# Edge normalization for community detection
lo = edge_df[["source", "target"]].min(axis=1)
hi = edge_df[["source", "target"]].max(axis=1)
edge_df["source"] = lo
edge_df["target"] = hi
edge_df.drop_duplicates(subset=["source", "target"], keep="last")

Entity and relationship summarization

After extraction, entities and relationships often have multiple descriptions from different text units. The summarization phase consolidates these:
1

Collect descriptions

Entities and relationships with the same identity gather all their descriptions from different text units into lists.
2

LLM summarization

The summarization model receives all descriptions and generates a single concise summary that captures all distinct information:
entity_summaries, relationship_summaries = await summarize_descriptions(
    entities_df=extracted_entities,
    relationships_df=extracted_relationships,
    model=summarization_model,
    max_summary_length=max_summary_length,
    prompt=summarization_prompt,
)
3

Replace descriptions

Original description lists are replaced with the summarized versions, creating clean, consistent descriptions across the knowledge graph.
Summarization is crucial for managing token counts in downstream queries and ensuring each entity/relationship has coherent, non-redundant descriptions.

Claim extraction (covariates)

Beyond entities and relationships, GraphRAG can extract claims—factual statements about entities that may be time-bound.
Claims (called “covariates” in the data model) are assertions about entities with specific properties:
  • Subject: The entity the claim is about
  • Object: What is being claimed
  • Type: Category of claim
  • Status: Validity or confidence level
  • Start/End date: Time bounds when applicable
  • Description: Full context of the claim
  • Source references: Links to supporting text units
Example:
{
  "subject": "Microsoft",
  "object": "acquired GitHub",
  "type": "ACQUISITION",
  "status": "CONFIRMED",
  "start_date": "2018-06-04",
  "description": "Microsoft acquired GitHub for $7.5 billion"
}

Graph properties and metrics

Once extracted, the knowledge graph has several important properties:

Entity ranking

Entities are ranked by importance using graph metrics:
  • Degree: Number of relationships connected to the entity
  • Centrality: Position in the network (highly connected entities have higher centrality)
  • Community membership: Which communities the entity belongs to at different hierarchy levels
# Entity model attributes
rank: int = 1  # Higher rank = more important entity
community_ids: list[str]  # Community memberships
Entity rank influences:
  • Prioritization in local search results
  • Size of nodes in graph visualizations
  • Context window allocation during retrieval

Graph structure

The complete graph structure includes:

Connectivity

Entities connected by relationships form a network where information can be traversed through multi-hop paths.

Clustering

Community detection reveals groups of densely connected entities, representing coherent topics or themes.

Hierarchy

Multiple levels of communities create a hierarchical organization from global themes to local clusters.

Provenance

Every entity and relationship maintains links to source text units and documents for verification.

From graph to retrieval

The knowledge graph enables sophisticated retrieval strategies:
  1. Entity-based entry points: Queries identify relevant entities through semantic similarity
  2. Graph traversal: Related entities and relationships are retrieved by following edges
  3. Community context: Entities’ community memberships provide broader thematic context
  4. Multi-hop reasoning: Connections can be followed multiple steps to gather comprehensive information
The next concept page on community detection explores how hierarchical clustering organizes the knowledge graph into meaningful structures.

Best practices

  • Start with general types (PERSON, ORGANIZATION, LOCATION)
  • Use prompt tuning to identify domain-specific types
  • Keep types consistent and well-defined
  • Avoid too many types (5-10 is usually sufficient)
  • Use appropriate text unit sizes (1200 tokens default)
  • Configure max_gleanings (1-2) for iterative refinement
  • Tune prompts for your domain using the prompt tuning guide
  • Review sample extractions before processing large datasets
  • Check entity and relationship counts after extraction
  • Review high-degree entities for quality
  • Validate that relationships make semantic sense
  • Ensure text unit references are preserved

Next steps

Indexing pipeline

See how graph extraction fits into the full indexing workflow

Community detection

Learn how hierarchical clustering organizes the graph

Build docs developers (and LLMs) love