prompt-tune command automatically generates domain-specific prompts by analyzing your input documents. This creates prompts that are better suited to your specific use case than the default generic prompts.
Usage
Options
The project root directory.Aliases:
-rThe domain your input data is related to. For example: “space science”, “microbiology”, “environmental news”, “legal documents”.If not specified, the domain will be automatically inferred from the input data.
The directory to save generated prompts to, relative to the project root directory.Aliases:
-oThe text chunk selection method for prompt generation.Available methods:
random- Randomly select chunks (fastest)top- Select top chunks by orderauto- Automatically select diverse chunks using embeddings (best quality)
The number of documents to load when
--selection-method is random or top.The number of text chunks to embed when
--selection-method=auto. Higher values provide more diversity but take longer.The maximum number of documents to select from each centroid when
--selection-method=auto.The maximum token count for prompt generation examples.
The minimum number of examples to generate/include in the entity extraction prompt.
The size of each example text chunk in tokens. Overrides
chunking.size in the configuration file.The overlap size for chunking documents in tokens. Overrides
chunking.overlap in the configuration file.The primary language used for inputs and outputs in GraphRAG prompts.Examples: “English”, “Spanish”, “French”, “Japanese”If not specified, defaults to English.
Discover and extract unspecified entity types from the data.Use
--no-discover-entity-types to disable and only use predefined entity types.Run the prompt tuning pipeline with verbose logging.Aliases:
-vExamples
Basic prompt tuning
Generate prompts with automatic domain inference:Specify domain
Use auto selection for better quality
auto selection method uses embeddings to select diverse, representative samples from your documents.
Customize output directory
Multilingual prompt generation
Increase sample diversity
Use predefined entity types only
Custom chunking parameters
Verbose output
Generated prompts
The prompt-tune command generates three customized prompt files:extract_graph.txt
Prompt for extracting entities and relationships from text. This is customized with:- Domain-specific entity types discovered from your data
- Example extractions from your actual documents
- Domain context and terminology
summarize_descriptions.txt
Prompt for summarizing entity descriptions. Customized with:- Domain-specific summarization guidelines
- Examples from your data
community_report.txt
Prompt for generating community reports. Customized with:- Domain-appropriate report structure
- Relevant analysis dimensions for your domain
How it works
- Sample selection: Selects representative text chunks from your documents using the specified selection method
- Domain analysis: Analyzes the samples to understand domain characteristics and terminology
- Entity discovery: Identifies domain-specific entity types present in your data
- Example generation: Creates few-shot examples by extracting entities from sample chunks
- Prompt creation: Generates prompts incorporating discovered entity types and examples
Selection methods
Random selection
- Fastest method
- Randomly samples chunks from documents
- Good for homogeneous datasets
- Use
--limitto control sample size
Top selection
- Selects first N chunks in order
- Fast and deterministic
- Good when early content is representative
Auto selection (recommended)
- Uses embeddings to find diverse, representative samples
- Creates embeddings for up to
--n-subset-maxchunks - Clusters chunks and selects up to
--kfrom each cluster - Best quality but slower due to embedding generation
- Recommended for heterogeneous datasets
Best practices
- Use auto selection: For best results, use
--selection-method autowith adequate sample size - Specify domain: Providing a domain helps generate more focused prompts
- Diverse samples: Ensure your input documents are representative of your full dataset
- Review outputs: Always review generated prompts before using them for indexing
- Iterate: Try different selection methods and parameters to find what works best
Customizing generated prompts
After generation, you can manually edit the prompts in the output directory:- Review the generated entity types and examples
- Add or remove entity types as needed
- Adjust the prompt instructions for your specific needs
- Test with a small index before running on full dataset
Using tuned prompts
After generating prompts:- Review the prompts in the
prompts/directory (or your custom output directory) - Copy them to your project’s
prompts/directory if using a custom output location - Run indexing - it will automatically use the new prompts:
Performance considerations
- Auto selection: Slower due to embedding generation but produces better quality
- Sample size: Larger
--n-subset-maxvalues take longer but capture more diversity - Token limits: Higher
--max-tokensallows for more examples but increases API costs
Error handling
Common issues:- No input documents: Ensure documents exist in your
input/directory - API errors: Check that your API key is configured in
.env - Insufficient samples: Increase
--limitor--n-subset-maxif you get too few examples - Language mismatch: Ensure
--languagematches your document language