What is community detection?
Community detection identifies groups of entities that are densely connected to each other but sparsely connected to entities in other groups. In GraphRAG, this reveals:- Thematic clusters: Groups of entities discussing related topics
- Organizational structure: How information in your dataset is naturally organized
- Multiple granularities: From broad themes to specific subtopics
- Navigation pathways: How to traverse from global to local information
In a knowledge graph visualization, each circle represents an entity sized by its degree (number of connections), and colors represent different community memberships.
The Leiden algorithm
GraphRAG uses the Leiden algorithm, a state-of-the-art community detection method that improves upon the Louvain algorithm.Why Leiden?
Quality
Leiden finds better-connected communities by addressing disconnected community issues in Louvain.
Scalability
Efficient on large graphs with thousands of entities and relationships.
Hierarchical
Naturally supports multi-level hierarchies through recursive application.
Deterministic
Produces reproducible results with a fixed random seed.
Algorithm overview
Graph preparation
The entity-relationship graph is converted to an undirected weighted graph:
- Nodes: Entities from the knowledge graph
- Edges: Relationships between entities
- Weights: Relationship strength (based on frequency and context)
Optional LCC filtering
If configured, extract only the largest connected component:This focuses clustering on the main graph, filtering out small disconnected components.
Initial clustering
The Leiden algorithm identifies communities by optimizing modularity—a measure of how well the graph is partitioned into communities.Communities are groups where:
- High edge density within the community
- Low edge density between communities
Hierarchical Leiden implementation
GraphRAG uses a custom hierarchical implementation built on thegraspologic_native library:
Key parameters
- max_cluster_size
- use_lcc
- seed
- Advanced parameters
Default: 10 entitiesControls the maximum number of entities in leaf communities (level 0).Effects:
- Smaller values (5-10): More hierarchy levels, finer granularity, more detailed reports
- Larger values (20-50): Fewer levels, broader communities, less detailed reports
- Smaller communities → more community reports → higher LLM costs
- Larger communities → broader summaries → may miss nuances
- Default of 10 balances detail with cost
Community hierarchy structure
The hierarchical clustering produces a tree structure of communities.Hierarchy levels
Level 0: Leaf communities
Level 0: Leaf communities
The most granular level containing individual entity groups.Characteristics:
- Maximum max_cluster_size entities per community
- Most detailed, specific topics
- Highest number of communities
- No children, only parents
- Community 0: [“Microsoft”, “Azure”, “Cloud Computing”]
- Community 1: [“Python”, “Pandas”, “NumPy”]
- Community 2: [“GraphRAG”, “Knowledge Graph”, “RAG”]
Level 1-N: Intermediate levels
Level 1-N: Intermediate levels
Mid-level communities that aggregate leaf communities.Characteristics:
- Aggregate multiple child communities
- Broader thematic groupings
- Both parent and children relationships
- Fewer communities than level 0
- Community 10: Aggregates communities 0, 1, 2
- Topic: “Technology and Software”
Top level: Root communities
Top level: Root communities
Highest-level communities representing major dataset themes.Characteristics:
- 1-5 communities typically
- Dataset-wide themes
- No parents, only children
- Most abstract summaries
- Community 100: All technology-related entities
- Community 101: All business-related entities
Community data model
Each community in the hierarchy contains:Relationship aggregation
Communities include only intra-community relationships—edges where both source and target are in the same community:Community summarization
After communities are detected, LLM-generated summaries make them human-readable and useful for retrieval.Report generation process
Gather community context
For each community, collect all relevant information:
- Entity titles, types, and descriptions
- Relationship descriptions
- Covariate/claim information (if available)
- Text unit excerpts
- Child community summaries (for non-leaf communities)
Structure the prompt
The community report prompt includes:
- Community entities and their roles
- Key relationships and connections
- Important claims or facts
- Instructions for report structure
Generate report
The LLM creates a structured summary including:
- Title: Descriptive name for the community
- Summary: Executive overview of the community
- Key entities: Most important entities and their significance
- Findings: Main insights and themes
- Rating: Importance score (1-10)
Bottom-up summarization
Community reports are generated bottom-up: leaf communities first, then progressively higher levels using child summaries as context.
- Leaf communities: Summarize raw entities and relationships
- Parent communities: Summarize child community reports (already condensed)
- Coherence: Each level builds on previous summaries
- Efficiency: Parent reports don’t need to re-process all raw entities
Configuration
Using communities in retrieval
Communities enable different search strategies:- Global search
- Local search
- DRIFT search
Uses community reports for dataset-wide reasoning:
- Select hierarchy level: Choose level based on desired granularity
- Retrieve reports: Get all community reports at that level
- Map step: Generate intermediate answers from each report
- Reduce step: Aggregate intermediate answers into final response
- Root level: Broad, high-level themes (faster, less detailed)
- Mid level: Balanced breadth and depth
- Leaf level: Most detailed (slower, more comprehensive)
Analyzing community structure
Metrics to monitor
Number of levels
How many hierarchy levels were created.Typical: 2-4 levels
Too many (5+): max_cluster_size may be too small
Too few (1): max_cluster_size may be too large
Communities per level
Distribution of communities across levels.Level 0: Most communities (hundreds to thousands)
Mid levels: Fewer communities
Root: 1-5 communities
Entity distribution
How entities are distributed across communities.Check for:
- Communities with very few entities
- Unbalanced distribution
- Isolated entities
Coverage
Percentage of entities in communities.High coverage (>95%): Good graph connectivity
Low coverage: Many disconnected entities (consider use_lcc setting)
Quality indicators
Coherent communities
Coherent communities
Good communities have:
- Semantically related entities
- Dense internal connections
- Clear thematic focus
- Meaningful summaries
Hierarchy coherence
Hierarchy coherence
Good hierarchies have:
- Progressive abstraction from leaf to root
- Parent summaries that meaningfully aggregate children
- Clear thematic progression
Graph connectivity
Graph connectivity
A well-connected graph produces better communities:
- Most entities in the largest connected component
- Few isolated nodes
- Balanced degree distribution
Best practices
Analyze results
Review:
- Number of communities at each level
- Sample community reports for coherence
- Entity distribution
- Hierarchy depth
Tune if needed
Adjust max_cluster_size based on:
- Too broad: Decrease max_cluster_size for finer granularity
- Too fragmented: Increase max_cluster_size for broader communities
- Cost concerns: Larger clusters = fewer reports = lower cost
Next steps
Retrieval methods
Learn how communities enable global, local, and DRIFT search
Indexing pipeline
See how community detection fits into the full workflow