graphrag prompt-tune

The prompt-tune command automatically generates domain-specific prompts by analyzing your input documents. This creates prompts that are better suited to your specific use case than the default generic prompts.

Usage

graphrag prompt-tune [OPTIONS]

Options

--root

string

default:"current directory"

The project root directory.Aliases: -r

--domain

string

The domain your input data is related to. For example: “space science”, “microbiology”, “environmental news”, “legal documents”.If not specified, the domain will be automatically inferred from the input data.

--output

string

default:"prompts"

The directory to save generated prompts to, relative to the project root directory.Aliases: -o

--selection-method

string

default:"random"

The text chunk selection method for prompt generation.Available methods:

random - Randomly select chunks (fastest)
top - Select top chunks by order
auto - Automatically select diverse chunks using embeddings (best quality)

--limit

integer

default:"15"

The number of documents to load when --selection-method is random or top.

--n-subset-max

integer

default:"300"

The number of text chunks to embed when --selection-method=auto. Higher values provide more diversity but take longer.

--k

integer

default:"15"

The maximum number of documents to select from each centroid when --selection-method=auto.

--max-tokens

integer

default:"2000"

The maximum token count for prompt generation examples.

--min-examples-required

integer

default:"2"

The minimum number of examples to generate/include in the entity extraction prompt.

--chunk-size

integer

default:"1200"

The size of each example text chunk in tokens. Overrides chunking.size in the configuration file.

--overlap

integer

default:"100"

The overlap size for chunking documents in tokens. Overrides chunking.overlap in the configuration file.

--language

string

The primary language used for inputs and outputs in GraphRAG prompts.Examples: “English”, “Spanish”, “French”, “Japanese”If not specified, defaults to English.

--discover-entity-types

boolean

default:"true"

Discover and extract unspecified entity types from the data.Use --no-discover-entity-types to disable and only use predefined entity types.

--verbose

boolean

default:"false"

Run the prompt tuning pipeline with verbose logging.Aliases: -v

Examples

Basic prompt tuning

Generate prompts with automatic domain inference:

graphrag prompt-tune

Specify domain

graphrag prompt-tune --domain "medical research"

Use auto selection for better quality

graphrag prompt-tune --selection-method auto

The auto selection method uses embeddings to select diverse, representative samples from your documents.

Customize output directory

graphrag prompt-tune --output custom_prompts

Multilingual prompt generation

graphrag prompt-tune --language "Spanish" --domain "noticias políticas"

Increase sample diversity

graphrag prompt-tune \
  --selection-method auto \
  --n-subset-max 500 \
  --k 20

Use predefined entity types only

graphrag prompt-tune --no-discover-entity-types

Custom chunking parameters

graphrag prompt-tune \
  --chunk-size 800 \
  --overlap 200

Verbose output

graphrag prompt-tune --verbose

Generated prompts

The prompt-tune command generates three customized prompt files:

extract_graph.txt

Prompt for extracting entities and relationships from text. This is customized with:

Domain-specific entity types discovered from your data
Example extractions from your actual documents
Domain context and terminology

summarize_descriptions.txt

Prompt for summarizing entity descriptions. Customized with:

Domain-specific summarization guidelines
Examples from your data

community_report.txt

Prompt for generating community reports. Customized with:

Domain-appropriate report structure
Relevant analysis dimensions for your domain

How it works

Sample selection: Selects representative text chunks from your documents using the specified selection method
Domain analysis: Analyzes the samples to understand domain characteristics and terminology
Entity discovery: Identifies domain-specific entity types present in your data
Example generation: Creates few-shot examples by extracting entities from sample chunks
Prompt creation: Generates prompts incorporating discovered entity types and examples

Selection methods

Random selection

Fastest method
Randomly samples chunks from documents
Good for homogeneous datasets
Use --limit to control sample size

Top selection

Selects first N chunks in order
Fast and deterministic
Good when early content is representative

Auto selection (recommended)

Uses embeddings to find diverse, representative samples
Creates embeddings for up to --n-subset-max chunks
Clusters chunks and selects up to --k from each cluster
Best quality but slower due to embedding generation
Recommended for heterogeneous datasets

Best practices

Use auto selection: For best results, use --selection-method auto with adequate sample size
Specify domain: Providing a domain helps generate more focused prompts
Diverse samples: Ensure your input documents are representative of your full dataset
Review outputs: Always review generated prompts before using them for indexing
Iterate: Try different selection methods and parameters to find what works best

Customizing generated prompts

After generation, you can manually edit the prompts in the output directory:

Review the generated entity types and examples
Add or remove entity types as needed
Adjust the prompt instructions for your specific needs
Test with a small index before running on full dataset

Using tuned prompts

After generating prompts:

Review the prompts in the prompts/ directory (or your custom output directory)
Copy them to your project’s prompts/ directory if using a custom output location
Run indexing - it will automatically use the new prompts:
```
graphrag index
```

Performance considerations

Auto selection: Slower due to embedding generation but produces better quality
Sample size: Larger --n-subset-max values take longer but capture more diversity
Token limits: Higher --max-tokens allows for more examples but increases API costs

Error handling

Common issues:

No input documents: Ensure documents exist in your input/ directory
API errors: Check that your API key is configured in .env
Insufficient samples: Increase --limit or --n-subset-max if you get too few examples
Language mismatch: Ensure --language matches your document language

Python API

CLI Reference

Data Models

Configuration Schema

graphrag prompt-tune

Usage

Options

Examples

Basic prompt tuning

Specify domain

Use auto selection for better quality

Customize output directory

Multilingual prompt generation

Increase sample diversity

Use predefined entity types only

Custom chunking parameters

Verbose output

Generated prompts

extract_graph.txt

summarize_descriptions.txt

community_report.txt

How it works

Selection methods

Random selection

Top selection

Auto selection (recommended)

Best practices

Customizing generated prompts

Using tuned prompts

Performance considerations

Error handling

Next steps

Build docs developers (and LLMs) love

Python API

CLI Reference

Data Models

Configuration Schema

​Usage

​Options

​Examples

​Basic prompt tuning

​Specify domain

​Use auto selection for better quality

​Customize output directory

​Multilingual prompt generation

​Increase sample diversity

​Use predefined entity types only

​Custom chunking parameters

​Verbose output

​Generated prompts

​extract_graph.txt

​summarize_descriptions.txt

​community_report.txt

​How it works

​Selection methods

​Random selection

​Top selection

​Auto selection (recommended)

​Best practices

​Customizing generated prompts

​Using tuned prompts

​Performance considerations

​Error handling

​Next steps

Build docs developers (and LLMs) love

Usage

Options

Examples

Basic prompt tuning

Specify domain

Use auto selection for better quality

Customize output directory

Multilingual prompt generation

Increase sample diversity

Use predefined entity types only

Custom chunking parameters

Verbose output

Generated prompts

extract_graph.txt

summarize_descriptions.txt

community_report.txt

How it works

Selection methods

Random selection

Top selection

Auto selection (recommended)

Best practices

Customizing generated prompts

Using tuned prompts

Performance considerations

Error handling

Next steps