Configuration schema

The GraphRAG configuration schema defines all available settings for the indexing pipeline and search operations. Configuration can be provided via YAML or JSON files.

Root configuration

The root GraphRagConfig class contains all top-level configuration settings.

completion_models

dict[str, ModelConfig]

default:"{}"

Available completion model configurations. Maps model IDs to their respective configurations.

embedding_models

dict[str, ModelConfig]

default:"{}"

Available embedding model configurations. Maps model IDs to their respective configurations.

concurrent_requests

int

default:"25"

The default number of concurrent requests to make to language models.

async_mode

AsyncType

default:"threaded"

The default asynchronous mode to use for language model requests. See AsyncType enum.

input

InputConfig

default:"InputConfig()"

The input configuration for document sources.

input_storage

StorageConfig

default:"StorageConfig(base_dir='input')"

The input storage configuration. For file storage, base_dir defaults to input.

chunking

ChunkingConfig

The chunking configuration to use. See chunking configuration.

output_storage

StorageConfig

default:"StorageConfig(base_dir='output')"

The output storage configuration. For file storage, base_dir defaults to output.

update_output_storage

StorageConfig

default:"StorageConfig(base_dir='update_output')"

The output configuration for the updated index. For file storage, base_dir defaults to update_output.

table_provider

TableProviderConfig

default:"TableProviderConfig()"

The table provider configuration. By default, parquet files are read/written to disk. You can register custom output table storage.

cache

CacheConfig

The cache configuration for storing LLM responses and intermediate results.

reporting

ReportingConfig

default:"ReportingConfig()"

The reporting configuration. See reporting configuration.

vector_store

VectorStoreConfig

default:"VectorStoreConfig()"

The vector store configuration. Defaults to LanceDB with db_uri set to output/lancedb.

workflows

list[str] | None

default:"None"

List of workflows to run, in execution order. This always overrides any built-in workflow methods.

Indexing configuration

Text embedding

embed_text

EmbedTextConfig

default:"EmbedTextConfig()"

Text embedding configuration.Fields:

embedding_model_id (str): The model ID to use for text embeddings. Default: "default_embedding_model"
model_instance_name (str): The model singleton instance name. Default: "text_embedding"
batch_size (int): The batch size to use. Default: 16
batch_max_tokens (int): The batch max tokens to use. Default: 8191
names (list[str]): The specific embeddings to perform.

Graph extraction

extract_graph

ExtractGraphConfig

default:"ExtractGraphConfig()"

The entity extraction configuration to use.Fields:

completion_model_id (str): The model ID to use. Default: "default_completion_model"
model_instance_name (str): The model singleton instance name. Default: "extract_graph"
prompt (str | None): The entity extraction prompt to use. Default: None
entity_types (list[str]): The entity extraction entity types to use. Default: ["organization", "person", "geo", "event"]
max_gleanings (int): The maximum number of entity gleanings to use. Default: 1

extract_graph_nlp

ExtractGraphNLPConfig

default:"ExtractGraphNLPConfig()"

The NLP-based graph extraction configuration to use. Used for fast indexing mode.Fields:

normalize_edge_weights (bool): Whether to normalize edge weights. Default: True
text_analyzer (TextAnalyzerDefaults): Text analyzer configuration
concurrent_requests (int): Number of concurrent requests. Default: 25
async_mode (AsyncType): Async mode to use. Default: AsyncType.Threaded

Description summarization

summarize_descriptions

SummarizeDescriptionsConfig

default:"SummarizeDescriptionsConfig()"

The description summarization configuration to use.Fields:

prompt (str | None): The summarization prompt. Default: None
max_length (int): Maximum length in tokens. Default: 500
max_input_tokens (int): Maximum input tokens. Default: 4000
completion_model_id (str): Model ID to use. Default: "default_completion_model"
model_instance_name (str): Model instance name. Default: "summarize_descriptions"

Graph processing

prune_graph

PruneGraphConfig

default:"PruneGraphConfig()"

The graph pruning configuration to use.Fields:

min_node_freq (int): Minimum node frequency. Default: 2
max_node_freq_std (None): Maximum node frequency standard deviation. Default: None
min_node_degree (int): Minimum node degree. Default: 1
max_node_degree_std (None): Maximum node degree standard deviation. Default: None
min_edge_weight_pct (float): Minimum edge weight percentage. Default: 40.0
remove_ego_nodes (bool): Whether to remove ego nodes. Default: True
lcc_only (bool): Keep only largest connected component. Default: False

cluster_graph

ClusterGraphConfig

default:"ClusterGraphConfig()"

The cluster graph configuration to use.Fields:

max_cluster_size (int): The maximum cluster size to use. Default: 10
use_lcc (bool): Whether to use the largest connected component. Default: True
seed (int): The seed to use for the clustering. Default: 0xDEADBEEF

Claims extraction

extract_claims

ExtractClaimsConfig

default:"ExtractClaimsConfig(enabled=False)"

The claim extraction configuration to use.Fields:

enabled (bool): Whether claim extraction is enabled. Default: False
prompt (str | None): The extraction prompt. Default: None
description (str): Description of claims to extract. Default: "Any claims or facts that could be relevant to information discovery."
max_gleanings (int): Maximum number of gleanings. Default: 1
completion_model_id (str): Model ID to use. Default: "default_completion_model"
model_instance_name (str): Model instance name. Default: "extract_claims"

Community reports

community_reports

CommunityReportsConfig

default:"CommunityReportsConfig()"

The community reports configuration to use.Fields:

completion_model_id (str): The model ID to use. Default: "default_completion_model"
model_instance_name (str): The model instance name. Default: "community_reporting"
graph_prompt (str | None): Prompt for graph-based summarization. Default: None
text_prompt (str | None): Prompt for text-based summarization. Default: None
max_length (int): Maximum length in tokens. Default: 2000
max_input_length (int): Maximum input length in tokens. Default: 8000

Snapshots

snapshots

SnapshotsConfig

default:"SnapshotsConfig()"

The snapshots configuration to use.Fields:

embeddings (bool): Whether to save embedding snapshots. Default: False
graphml (bool): Whether to save GraphML snapshots. Default: False
raw_graph (bool): Whether to save raw graph snapshots. Default: False

Search configuration

Local search

local_search

LocalSearchConfig

default:"LocalSearchConfig()"

The local search configuration.Fields:

prompt (str | None): The local search prompt to use. Default: None
completion_model_id (str): Model ID to use. Default: "default_completion_model"
embedding_model_id (str): Model ID for embeddings. Default: "default_embedding_model"
text_unit_prop (float): The text unit proportion. Default: 0.5
community_prop (float): The community proportion. Default: 0.15
conversation_history_max_turns (int): Maximum conversation turns. Default: 5
top_k_entities (int): Top k mapped entities. Default: 10
top_k_relationships (int): Top k mapped relations. Default: 10
max_context_tokens (int): Maximum tokens. Default: 12000

Global search

global_search

GlobalSearchConfig

default:"GlobalSearchConfig()"

The global search configuration.Fields:

map_prompt (str | None): The global search mapper prompt. Default: None
reduce_prompt (str | None): The global search reducer prompt. Default: None
completion_model_id (str): Model ID to use. Default: "default_completion_model"
knowledge_prompt (str | None): The global search general prompt. Default: None
max_context_tokens (int): Maximum context size in tokens. Default: 12000
data_max_tokens (int): Data LLM maximum tokens. Default: 12000
map_max_length (int): Map LLM max response length in words. Default: 1000
reduce_max_length (int): Reduce LLM max response length in words. Default: 2000
dynamic_search_threshold (int): Rating threshold to include a community. Default: 1
dynamic_search_keep_parent (bool): Keep parent if child communities relevant. Default: False
dynamic_search_num_repeats (int): Number of times to rate same report. Default: 1
dynamic_search_use_summary (bool): Use community summary instead of full context. Default: False
dynamic_search_max_level (int): Maximum community hierarchy level. Default: 2

DRIFT search

drift_search

DRIFTSearchConfig

default:"DRIFTSearchConfig()"

The DRIFT search configuration.Fields:

prompt (str | None): The DRIFT search prompt. Default: None
reduce_prompt (str | None): The reduce prompt. Default: None
data_max_tokens (int): Maximum data tokens. Default: 12000
reduce_max_tokens (None): Maximum reduce tokens. Default: None
reduce_temperature (float): Reduce temperature. Default: 0
reduce_max_completion_tokens (None): Max completion tokens. Default: None
concurrency (int): Concurrency level. Default: 32
drift_k_followups (int): Number of followup queries. Default: 20
primer_folds (int): Number of primer folds. Default: 5
primer_llm_max_tokens (int): Primer LLM max tokens. Default: 12000
n_depth (int): Search depth. Default: 3
Additional local search parameters for DRIFT’s local search component
completion_model_id (str): Model ID to use. Default: "default_completion_model"
embedding_model_id (str): Embedding model ID. Default: "default_embedding_model"

Basic search

basic_search

BasicSearchConfig

default:"BasicSearchConfig()"

The basic search configuration.Fields:

prompt (None): The basic search prompt. Default: None
k (int): Number of results to return. Default: 10
max_context_tokens (int): Maximum context tokens. Default: 12000
completion_model_id (str): Model ID to use. Default: "default_completion_model"
embedding_model_id (str): Embedding model ID. Default: "default_embedding_model"

Helper methods

The GraphRagConfig class provides helper methods to retrieve model configurations:

get_completion_model_config

def get_completion_model_config(self, model_id: str) -> ModelConfig

Get a completion model configuration by ID. Parameters:

model_id (str): The ID of the model to get. Should match an ID in the completion_models list.

Returns:

ModelConfig: The model configuration if found.

Raises:

ValueError: If the model ID is not found in the configuration.

get_embedding_model_config

def get_embedding_model_config(self, model_id: str) -> ModelConfig

Get an embedding model configuration by ID. Parameters:

model_id (str): The ID of the model to get. Should match an ID in the embedding_models list.

Returns:

ModelConfig: The model configuration if found.

Raises:

ValueError: If the model ID is not found in the configuration.

Python API

CLI Reference

Data Models

Configuration Schema

Configuration schema

Root configuration

Indexing configuration

Text embedding

Graph extraction

Description summarization

Graph processing

Claims extraction

Community reports

Snapshots

Search configuration

Local search

Global search

DRIFT search

Basic search

Helper methods

get_completion_model_config

get_embedding_model_config

Build docs developers (and LLMs) love

Python API

CLI Reference

Data Models

Configuration Schema

​Root configuration

​Indexing configuration

​Text embedding

​Graph extraction

​Description summarization

​Graph processing

​Claims extraction

​Community reports

​Snapshots

​Search configuration

​Local search

​Global search

​DRIFT search

​Basic search

​Helper methods

​get_completion_model_config

​get_embedding_model_config

Build docs developers (and LLMs) love

Root configuration

Indexing configuration

Text embedding

Graph extraction

Description summarization

Graph processing

Claims extraction

Community reports

Snapshots

Search configuration

Local search

Global search

DRIFT search

Basic search

Helper methods

get_completion_model_config

get_embedding_model_config