Skip to main content
Global search generates answers by analyzing all AI-generated community reports in a map-reduce fashion. This method is ideal for questions that require understanding of the dataset as a whole.
This page references the global_search.ipynb notebook from the GraphRAG repository.
Global search is best suited for:
  • High-level questions - “What are the main themes?”
  • Dataset-wide insights - “What are the most significant trends?”
  • Comparative analysis - “How do different communities relate?”
  • Summarization tasks - “What is this dataset about?”

How global search works

1

Community report retrieval

Global search loads all community reports generated during indexing. These reports summarize clusters of related entities and relationships.
2

Map phase

Each community report is sent to the LLM with your question, generating intermediate answers from different parts of the knowledge graph.
3

Reduce phase

Intermediate answers are aggregated and synthesized into a final, comprehensive response.

Setting up the notebook

Import required libraries

import os
import pandas as pd
from graphrag.config.enums import ModelType
from graphrag.config.models.language_model_config import LanguageModelConfig
from graphrag.language_model.manager import ModelManager
from graphrag.query.indexer_adapters import (
    read_indexer_communities,
    read_indexer_entities,
    read_indexer_reports,
)
from graphrag.query.structured_search.global_search.community_context import (
    GlobalCommunityContext,
)
from graphrag.query.structured_search.global_search.search import GlobalSearch
from graphrag.tokenizer.get_tokenizer import get_tokenizer

Configure language models

api_key = os.environ["GRAPHRAG_API_KEY"]

config = LanguageModelConfig(
    api_key=api_key,
    type=ModelType.Chat,
    model_provider="openai",
    model="gpt-4.1",
    max_retries=20,
)
model = ModelManager().get_or_create_chat_model(
    name="global_search",
    model_type=ModelType.Chat,
    config=config,
)

tokenizer = get_tokenizer(config)

Load community reports

# Path to indexing outputs
INPUT_DIR = "./inputs/operation dulce"
COMMUNITY_TABLE = "communities"
COMMUNITY_REPORT_TABLE = "community_reports"
ENTITY_TABLE = "entities"

# Community level to use (higher = more fine-grained, more expensive)
COMMUNITY_LEVEL = 2

# Load parquet files
community_df = pd.read_parquet(f"{INPUT_DIR}/{COMMUNITY_TABLE}.parquet")
entity_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_TABLE}.parquet")
report_df = pd.read_parquet(f"{INPUT_DIR}/{COMMUNITY_REPORT_TABLE}.parquet")

# Convert to GraphRAG data structures
communities = read_indexer_communities(community_df, report_df)
reports = read_indexer_reports(report_df, community_df, COMMUNITY_LEVEL)
entities = read_indexer_entities(entity_df, community_df, COMMUNITY_LEVEL)

print(f"Total report count: {len(report_df)}")
print(f"Report count at level {COMMUNITY_LEVEL}: {len(reports)}")
The COMMUNITY_LEVEL parameter controls granularity. Higher values use more detailed community reports but increase computational cost.

Build global context

context_builder = GlobalCommunityContext(
    community_reports=reports,
    communities=communities,
    entities=entities,  # Optional: enables community weight calculation
    tokenizer=tokenizer,
)

Configure search parameters

context_builder_params = {
    "use_community_summary": False,  # True = summaries, False = full reports
    "shuffle_data": True,  # Randomize report order
    "include_community_rank": True,  # Include rank for prioritization
    "min_community_rank": 0,  # Minimum rank threshold
    "community_rank_name": "rank",
    "include_community_weight": True,  # Weight by entity count
    "community_weight_name": "occurrence weight",
    "normalize_community_weight": True,
    "max_tokens": 12_000,  # Context window size
    "context_name": "Reports",
}

Create search engine

search_engine = GlobalSearch(
    model=model,
    context_builder=context_builder,
    tokenizer=tokenizer,
    max_data_tokens=12_000,
    map_llm_params=map_llm_params,
    reduce_llm_params=reduce_llm_params,
    allow_general_knowledge=False,  # Strictly use indexed data only
    json_mode=True,  # Requires model support
    context_builder_params=context_builder_params,
    concurrent_coroutines=32,  # Parallel map operations
    response_type="multiple paragraphs",  # Output format guidance
)
result = await search_engine.search("What is operation dulce?")

print(result.response)

Inspect context data

# View which reports were used
result.context_data["reports"]

Analyze token usage

print(f"LLM calls: {result.llm_calls}")
print(f"Prompt tokens: {result.prompt_tokens}")
print(f"Output tokens: {result.output_tokens}")
print(f"Total cost estimate: ${(result.prompt_tokens * 0.00001 + result.output_tokens * 0.00003):.4f}")

Example queries

result = await search_engine.search(
    "What are the top 5 themes in this dataset?"
)
print(result.response)
result = await search_engine.search(
    "How do different organizations interact in this narrative?"
)
print(result.response)
result = await search_engine.search(
    "What are the most significant events and their outcomes?"
)
print(result.response)
# Change response type for different output formats
search_engine.response_type = "executive summary"

result = await search_engine.search(
    "Provide a comprehensive summary of this dataset"
)
print(result.response)

Tuning parameters

Community level selection

  • Fewest communities
  • Broadest summaries
  • Lowest cost
  • Best for very high-level questions

Response type options

# Customize output format
response_types = [
    "single paragraph",
    "multiple paragraphs",
    "executive summary",
    "prioritized list",
    "bullet points",
    "detailed report",
]

search_engine.response_type = "bullet points"

Performance optimization

Parallel processing

Increase concurrent_coroutines for faster map phase
concurrent_coroutines=64

Token management

Adjust max_tokens based on your model limits
max_data_tokens=8000  # For 8k models

Community filtering

Set minimum rank to skip low-importance communities
min_community_rank=5

Summary mode

Use summaries instead of full reports to reduce tokens
use_community_summary=True
AspectGlobal SearchLocal Search
Question TypeHigh-level, broadSpecific, detailed
Data SourceCommunity reportsEntities + text chunks
CostHigher (map-reduce)Lower (single query)
Response TimeSlowerFaster
Best ForThemes, trends, summariesEntity details, relationships

Next steps

Local search

Learn about local search for specific queries

DRIFT search

Explore hybrid search methods

Search comparison

Compare all search methods

Query guide

Complete global search documentation

Build docs developers (and LLMs) love