Skip to main content
The GraphRAG configuration schema defines all available settings for the indexing pipeline and search operations. Configuration can be provided via YAML or JSON files.

Root configuration

The root GraphRagConfig class contains all top-level configuration settings.
completion_models
dict[str, ModelConfig]
default:"{}"
Available completion model configurations. Maps model IDs to their respective configurations.
embedding_models
dict[str, ModelConfig]
default:"{}"
Available embedding model configurations. Maps model IDs to their respective configurations.
concurrent_requests
int
default:"25"
The default number of concurrent requests to make to language models.
async_mode
AsyncType
default:"threaded"
The default asynchronous mode to use for language model requests. See AsyncType enum.
input
InputConfig
default:"InputConfig()"
The input configuration for document sources.
input_storage
StorageConfig
default:"StorageConfig(base_dir='input')"
The input storage configuration. For file storage, base_dir defaults to input.
chunking
ChunkingConfig
The chunking configuration to use. See chunking configuration.
output_storage
StorageConfig
default:"StorageConfig(base_dir='output')"
The output storage configuration. For file storage, base_dir defaults to output.
update_output_storage
StorageConfig
default:"StorageConfig(base_dir='update_output')"
The output configuration for the updated index. For file storage, base_dir defaults to update_output.
table_provider
TableProviderConfig
default:"TableProviderConfig()"
The table provider configuration. By default, parquet files are read/written to disk. You can register custom output table storage.
cache
CacheConfig
The cache configuration for storing LLM responses and intermediate results.
reporting
ReportingConfig
default:"ReportingConfig()"
The reporting configuration. See reporting configuration.
vector_store
VectorStoreConfig
default:"VectorStoreConfig()"
The vector store configuration. Defaults to LanceDB with db_uri set to output/lancedb.
workflows
list[str] | None
default:"None"
List of workflows to run, in execution order. This always overrides any built-in workflow methods.

Indexing configuration

Text embedding

embed_text
EmbedTextConfig
default:"EmbedTextConfig()"
Text embedding configuration.Fields:
  • embedding_model_id (str): The model ID to use for text embeddings. Default: "default_embedding_model"
  • model_instance_name (str): The model singleton instance name. Default: "text_embedding"
  • batch_size (int): The batch size to use. Default: 16
  • batch_max_tokens (int): The batch max tokens to use. Default: 8191
  • names (list[str]): The specific embeddings to perform.

Graph extraction

extract_graph
ExtractGraphConfig
default:"ExtractGraphConfig()"
The entity extraction configuration to use.Fields:
  • completion_model_id (str): The model ID to use. Default: "default_completion_model"
  • model_instance_name (str): The model singleton instance name. Default: "extract_graph"
  • prompt (str | None): The entity extraction prompt to use. Default: None
  • entity_types (list[str]): The entity extraction entity types to use. Default: ["organization", "person", "geo", "event"]
  • max_gleanings (int): The maximum number of entity gleanings to use. Default: 1
extract_graph_nlp
ExtractGraphNLPConfig
default:"ExtractGraphNLPConfig()"
The NLP-based graph extraction configuration to use. Used for fast indexing mode.Fields:
  • normalize_edge_weights (bool): Whether to normalize edge weights. Default: True
  • text_analyzer (TextAnalyzerDefaults): Text analyzer configuration
  • concurrent_requests (int): Number of concurrent requests. Default: 25
  • async_mode (AsyncType): Async mode to use. Default: AsyncType.Threaded

Description summarization

summarize_descriptions
SummarizeDescriptionsConfig
default:"SummarizeDescriptionsConfig()"
The description summarization configuration to use.Fields:
  • prompt (str | None): The summarization prompt. Default: None
  • max_length (int): Maximum length in tokens. Default: 500
  • max_input_tokens (int): Maximum input tokens. Default: 4000
  • completion_model_id (str): Model ID to use. Default: "default_completion_model"
  • model_instance_name (str): Model instance name. Default: "summarize_descriptions"

Graph processing

prune_graph
PruneGraphConfig
default:"PruneGraphConfig()"
The graph pruning configuration to use.Fields:
  • min_node_freq (int): Minimum node frequency. Default: 2
  • max_node_freq_std (None): Maximum node frequency standard deviation. Default: None
  • min_node_degree (int): Minimum node degree. Default: 1
  • max_node_degree_std (None): Maximum node degree standard deviation. Default: None
  • min_edge_weight_pct (float): Minimum edge weight percentage. Default: 40.0
  • remove_ego_nodes (bool): Whether to remove ego nodes. Default: True
  • lcc_only (bool): Keep only largest connected component. Default: False
cluster_graph
ClusterGraphConfig
default:"ClusterGraphConfig()"
The cluster graph configuration to use.Fields:
  • max_cluster_size (int): The maximum cluster size to use. Default: 10
  • use_lcc (bool): Whether to use the largest connected component. Default: True
  • seed (int): The seed to use for the clustering. Default: 0xDEADBEEF

Claims extraction

extract_claims
ExtractClaimsConfig
default:"ExtractClaimsConfig(enabled=False)"
The claim extraction configuration to use.Fields:
  • enabled (bool): Whether claim extraction is enabled. Default: False
  • prompt (str | None): The extraction prompt. Default: None
  • description (str): Description of claims to extract. Default: "Any claims or facts that could be relevant to information discovery."
  • max_gleanings (int): Maximum number of gleanings. Default: 1
  • completion_model_id (str): Model ID to use. Default: "default_completion_model"
  • model_instance_name (str): Model instance name. Default: "extract_claims"

Community reports

community_reports
CommunityReportsConfig
default:"CommunityReportsConfig()"
The community reports configuration to use.Fields:
  • completion_model_id (str): The model ID to use. Default: "default_completion_model"
  • model_instance_name (str): The model instance name. Default: "community_reporting"
  • graph_prompt (str | None): Prompt for graph-based summarization. Default: None
  • text_prompt (str | None): Prompt for text-based summarization. Default: None
  • max_length (int): Maximum length in tokens. Default: 2000
  • max_input_length (int): Maximum input length in tokens. Default: 8000

Snapshots

snapshots
SnapshotsConfig
default:"SnapshotsConfig()"
The snapshots configuration to use.Fields:
  • embeddings (bool): Whether to save embedding snapshots. Default: False
  • graphml (bool): Whether to save GraphML snapshots. Default: False
  • raw_graph (bool): Whether to save raw graph snapshots. Default: False

Search configuration

The local search configuration.Fields:
  • prompt (str | None): The local search prompt to use. Default: None
  • completion_model_id (str): Model ID to use. Default: "default_completion_model"
  • embedding_model_id (str): Model ID for embeddings. Default: "default_embedding_model"
  • text_unit_prop (float): The text unit proportion. Default: 0.5
  • community_prop (float): The community proportion. Default: 0.15
  • conversation_history_max_turns (int): Maximum conversation turns. Default: 5
  • top_k_entities (int): Top k mapped entities. Default: 10
  • top_k_relationships (int): Top k mapped relations. Default: 10
  • max_context_tokens (int): Maximum tokens. Default: 12000
The global search configuration.Fields:
  • map_prompt (str | None): The global search mapper prompt. Default: None
  • reduce_prompt (str | None): The global search reducer prompt. Default: None
  • completion_model_id (str): Model ID to use. Default: "default_completion_model"
  • knowledge_prompt (str | None): The global search general prompt. Default: None
  • max_context_tokens (int): Maximum context size in tokens. Default: 12000
  • data_max_tokens (int): Data LLM maximum tokens. Default: 12000
  • map_max_length (int): Map LLM max response length in words. Default: 1000
  • reduce_max_length (int): Reduce LLM max response length in words. Default: 2000
  • dynamic_search_threshold (int): Rating threshold to include a community. Default: 1
  • dynamic_search_keep_parent (bool): Keep parent if child communities relevant. Default: False
  • dynamic_search_num_repeats (int): Number of times to rate same report. Default: 1
  • dynamic_search_use_summary (bool): Use community summary instead of full context. Default: False
  • dynamic_search_max_level (int): Maximum community hierarchy level. Default: 2
The DRIFT search configuration.Fields:
  • prompt (str | None): The DRIFT search prompt. Default: None
  • reduce_prompt (str | None): The reduce prompt. Default: None
  • data_max_tokens (int): Maximum data tokens. Default: 12000
  • reduce_max_tokens (None): Maximum reduce tokens. Default: None
  • reduce_temperature (float): Reduce temperature. Default: 0
  • reduce_max_completion_tokens (None): Max completion tokens. Default: None
  • concurrency (int): Concurrency level. Default: 32
  • drift_k_followups (int): Number of followup queries. Default: 20
  • primer_folds (int): Number of primer folds. Default: 5
  • primer_llm_max_tokens (int): Primer LLM max tokens. Default: 12000
  • n_depth (int): Search depth. Default: 3
  • Additional local search parameters for DRIFT’s local search component
  • completion_model_id (str): Model ID to use. Default: "default_completion_model"
  • embedding_model_id (str): Embedding model ID. Default: "default_embedding_model"
The basic search configuration.Fields:
  • prompt (None): The basic search prompt. Default: None
  • k (int): Number of results to return. Default: 10
  • max_context_tokens (int): Maximum context tokens. Default: 12000
  • completion_model_id (str): Model ID to use. Default: "default_completion_model"
  • embedding_model_id (str): Embedding model ID. Default: "default_embedding_model"

Helper methods

The GraphRagConfig class provides helper methods to retrieve model configurations:

get_completion_model_config

def get_completion_model_config(self, model_id: str) -> ModelConfig
Get a completion model configuration by ID. Parameters:
  • model_id (str): The ID of the model to get. Should match an ID in the completion_models list.
Returns:
  • ModelConfig: The model configuration if found.
Raises:
  • ValueError: If the model ID is not found in the configuration.

get_embedding_model_config

def get_embedding_model_config(self, model_id: str) -> ModelConfig
Get an embedding model configuration by ID. Parameters:
  • model_id (str): The ID of the model to get. Should match an ID in the embedding_models list.
Returns:
  • ModelConfig: The model configuration if found.
Raises:
  • ValueError: If the model ID is not found in the configuration.

Build docs developers (and LLMs) love