settings.yaml or settings.json file in your project root. This page documents all available configuration options.
Environment variable substitution
Configuration values can reference environment variables using${VAR_NAME} syntax:
.env
If a
.env file is present in your project root, environment variables will be automatically loaded.Language model configuration
Completion models
Define completion models for text generation tasks:Embedding models
Define embedding models for vector generation:Input configuration
Input settings
Configure document input format and location:Input data format:
text, csv, json, or jsonlCharacter encoding for input files
Regex pattern to match input files (defaults based on type)
Column name for document IDs (CSV/JSON only)
Column name for document titles (CSV/JSON only)
Column name for document text content (CSV/JSON only)
Chunking configuration
Configure how documents are split into chunks:Chunking strategy:
tokens or sentenceMaximum chunk size in tokens
Number of overlapping tokens between chunks
Tokenizer model for splitting text
Document metadata fields to prepend to each chunk
Storage configuration
See the Storage page for detailed storage configuration.Workflow configurations
Text embedding
Configure text embedding generation:Reference to embedding model configuration
Maximum number of texts to embed in one batch
Maximum total tokens per batch
Which embeddings to generate:
text_unit_text, entity_description, community_full_contentGraph extraction
Configure LLM-based entity and relationship extraction:Reference to completion model configuration
Path to extraction prompt file
List of entity types to extract
Number of additional extraction passes for thoroughness
NLP-based extraction
Configure NLP-based graph extraction (alternative to LLM):Description summarization
Configure entity and relationship description summarization:Maximum output tokens per summary
Maximum input tokens to collect for summarization
Graph clustering
Configure Leiden hierarchical clustering:Maximum cluster size for export
Whether to use only the largest connected component
Random seed for consistent clustering results
Graph pruning
Configure optional graph pruning to optimize modularity:Minimum node frequency to retain
Minimum node degree (connections) to retain
Minimum edge weight percentile to retain
Remove ego nodes (nodes connected to everything)
Community reports
Configure community report generation:Prompt for graph-based community summarization
Prompt for text-based community summarization
Maximum output tokens per report
Maximum input tokens for report generation
Claim extraction
Configure optional claim extraction:Snapshots
Configure optional data snapshots:Export embeddings to parquet files
Export graph to GraphML format
Export raw extracted graph before merging
Query configurations
Local search
Configure local search for targeted queries:Global search
Configure global search for broad queries:DRIFT search
Configure DRIFT search for iterative exploration:Basic search
Configure basic vector search:Advanced settings
Workflows
Override the default workflow execution order:Most users don’t need to customize workflows. Only specify this if you want precise control over execution order.
Concurrency settings
Control global concurrency for async operations:Next steps
LLM models
Detailed guide to configuring language models
Storage
Configure storage backends and caching
Initialization
Learn about the init command
Start indexing
Begin processing your documents