GraphRAG uses language models for various tasks including entity extraction, summarization, and query responses. This page covers how to configure and customize your language models.
Model support
GraphRAG uses LiteLLM to support 100+ language models from various providers. This includes:
OpenAI (GPT-4, GPT-4 Turbo, GPT-4o, o1)
Azure OpenAI
Anthropic (Claude)
Google (Gemini)
AWS Bedrock
Local models via Ollama or LiteLLM Proxy
GraphRAG has been most thoroughly tested with OpenAI’s GPT-4 series models. Other models are supported but may require additional prompt tuning.
Basic configuration
OpenAI models
The default configuration uses OpenAI models:
completion_models :
default_completion_model :
model_provider : openai
model : gpt-4.1
auth_method : api_key
api_key : ${GRAPHRAG_API_KEY}
embedding_models :
default_embedding_model :
model_provider : openai
model : text-embedding-3-large
auth_method : api_key
api_key : ${GRAPHRAG_API_KEY}
Azure OpenAI
For Azure-hosted models:
completion_models :
default_completion_model :
model_provider : azure
model : gpt-4
auth_method : api_key
api_key : ${AZURE_OPENAI_API_KEY}
api_base : https://your-resource.openai.azure.com/
api_version : "2024-02-15-preview"
azure_deployment_name : gpt-4-deployment
If your Azure deployment name matches the model name, you can omit azure_deployment_name.
Azure Managed Identity
For production environments using managed identity:
completion_models :
default_completion_model :
model_provider : azure
model : gpt-4
auth_method : azure_managed_identity
api_base : https://your-resource.openai.azure.com/
api_version : "2024-02-15-preview"
Other providers
Configure other providers using LiteLLM’s format:
completion_models :
gemini_model :
model_provider : gemini
model : gemini-2.5-flash-lite
auth_method : api_key
api_key : ${GEMINI_API_KEY}
embedding_models :
gemini_embedding :
model_provider : gemini
model : gemini-embedding-001
auth_method : api_key
api_key : ${GEMINI_API_KEY}
See LiteLLM’s documentation for provider-specific configuration. The model_provider is the prefix before / and model is the suffix after /.
Model configuration options
Required parameters
The model provider (e.g., openai, azure, anthropic, gemini)
The specific model name (e.g., gpt-4.1, claude-3-5-sonnet-20241022)
Authentication method: api_key or azure_managed_identity
Optional parameters
LLM provider type: litellm or mock (for testing)
API key for authentication (use environment variables)
Base URL for API requests (required for Azure and custom endpoints)
Azure deployment name if different from model name
Default arguments sent with every request (e.g., temperature, max_tokens, n)
Call arguments
Set default parameters for all model calls:
completion_models :
default_completion_model :
model_provider : openai
model : gpt-4.1
auth_method : api_key
api_key : ${GRAPHRAG_API_KEY}
call_args :
temperature : 0.7
max_tokens : 2000
top_p : 0.95
Retry configuration
Configure automatic retry behavior for failed requests:
completion_models :
default_completion_model :
model_provider : openai
model : gpt-4.1
auth_method : api_key
api_key : ${GRAPHRAG_API_KEY}
retry :
type : exponential_backoff
max_retries : 7
base_delay : 2.0
max_delay : 60.0
jitter : true
retry.type
string
default: "exponential_backoff"
Retry strategy: exponential_backoff or immediate
Maximum number of retry attempts
Base delay in seconds for exponential backoff
Maximum delay between retries (no limit if not specified)
Add random jitter to retry delays
Rate limiting
Control request rate to avoid hitting API limits:
completion_models :
default_completion_model :
model_provider : openai
model : gpt-4.1
auth_method : api_key
api_key : ${GRAPHRAG_API_KEY}
rate_limit :
type : sliding_window
period_in_seconds : 60
requests_per_period : 100
tokens_per_period : 150000
rate_limit.type
string
default: "sliding_window"
Rate limiting strategy (currently only sliding_window supported)
rate_limit.period_in_seconds
Time window for rate limiting in seconds
rate_limit.requests_per_period
Maximum requests per time window
rate_limit.tokens_per_period
Maximum tokens per time window
Metrics configuration
Track model usage and performance:
completion_models :
default_completion_model :
model_provider : openai
model : gpt-4.1
auth_method : api_key
api_key : ${GRAPHRAG_API_KEY}
metrics :
type : default
store : memory
writer : log # or 'file'
log_level : 20 # INFO
base_dir : ./metrics # for file writer
Where to write metrics: log (console) or file
Directory for metrics files (when using file writer)
Multiple model configuration
Define different models for different tasks:
completion_models :
extraction_model :
model_provider : openai
model : gpt-4o
auth_method : api_key
api_key : ${GRAPHRAG_API_KEY}
query_model :
model_provider : openai
model : o1
auth_method : api_key
api_key : ${GRAPHRAG_API_KEY}
budget_model :
model_provider : openai
model : gpt-4o-mini
auth_method : api_key
api_key : ${GRAPHRAG_API_KEY}
embedding_models :
default_embedding_model :
model_provider : openai
model : text-embedding-3-large
auth_method : api_key
api_key : ${GRAPHRAG_API_KEY}
# Reference models in workflows
extract_graph :
completion_model_id : extraction_model
prompt : "prompts/extract_graph.txt"
entity_types : [ organization , person , geo , event ]
max_gleanings : 1
summarize_descriptions :
completion_model_id : budget_model
prompt : "prompts/summarize_descriptions.txt"
max_length : 500
community_reports :
completion_model_id : extraction_model
graph_prompt : "prompts/community_report_graph.txt"
max_length : 2000
global_search :
completion_model_id : query_model
map_prompt : "prompts/global_search_map_system_prompt.txt"
reduce_prompt : "prompts/global_search_reduce_system_prompt.txt"
Use different models to optimize cost vs. quality tradeoffs. For example:
High-end models (GPT-4o, o1) for critical extraction and queries
Mid-tier models (GPT-4o-mini) for summarization
Budget models for less critical tasks
Model selection considerations
GPT-4 series (recommended)
GraphRAG has been thoroughly tested with:
gpt-4 - Original GPT-4 model
gpt-4-turbo - Faster GPT-4 with larger context
gpt-4o - Optimized multimodal model
gpt-4o-mini - Smaller, faster, more affordable
o-series models (reasoning)
The o-series models include built-in reasoning:
o1 - Advanced reasoning model
o1-mini - Smaller reasoning model
o-series models have different parameters:
Use max_completion_tokens instead of max_tokens
Reasoning tokens count toward usage but are separate from output
Slower and more expensive than standard models
May require prompt adjustments (less explicit chain-of-thought)
Structured output requirements
Your chosen model must support structured outputs with JSON schema validation. Most modern models support this, but verify before using custom models.
Using custom models
Via proxy servers
Use Ollama or LiteLLM Proxy to connect unsupported models:
completion_models :
local_model :
model_provider : openai # Use OpenAI-compatible endpoint
model : llama3
auth_method : api_key
api_key : not-needed
api_base : http://localhost:11434/v1 # Ollama endpoint
Custom models may produce malformed JSON responses. Your proxy may need to:
Validate and fix JSON formatting
Handle schema validation errors
Ensure structured output compliance
Via model protocol (library usage)
For programmatic use, implement the model protocol:
from graphrag_llm.completion import LLMCompletion, register_completion
class MyCustomCompletionModel ( LLMCompletion ):
# Implement required methods
async def complete ( self , prompt : str , ** kwargs ):
# Your implementation
pass
# Register your model
register_completion( "my-custom-model" , MyCustomCompletionModel)
Then reference it in config:
completion_models :
custom_model :
type : my-custom-model
# Additional config passed to your implementation
Custom model implementations are only supported when using GraphRAG as a Python library, not via the CLI.
Best practices
Start with defaults
Begin with GPT-4o or GPT-4o-mini for reliable results
Configure retry and rate limiting
Set appropriate retry logic and rate limits to handle API issues
Use environment variables
Never hardcode API keys - use .env files and ${VAR} substitution
Monitor costs
Enable metrics to track token usage and optimize model selection
Test before production
Validate model performance on sample data before full indexing
Next steps
Settings reference Complete configuration options
Storage Configure storage and caching
Prompt tuning Optimize prompts for your models
Start indexing Begin processing documents