config.yaml file that controls processing behavior, deduplication settings, performance tuning, and more. This page documents all available configuration options.
Configuration File Location
configs/guantanamo/config.yamlconfigs/soviet_afghan_war/config.yamlconfigs/template/config.yaml.template
Basic Configuration
Domain Identification
Domain identifier. Must match the directory name in
configs/.Human-readable description of the research domain. Displayed in the web interface.
Data Sources
Path to the Parquet file containing your source articles. Must include columns:
title, content, url, published_date, source_type.Output Configuration
Directory where extracted entity Parquet files will be saved. Files created:
people.parquetorganizations.parquetlocations.parquetevents.parquet
Deduplication Configuration
Similarity Thresholds
Default similarity threshold for all entity types (0.0-1.0). Higher values are
more strict, requiring closer matches for deduplication.
Similarity threshold for people. Recommended: 0.80-0.85 for strict name matching.
Similarity threshold for organizations. Lower than people to account for
acronyms and name variants.
Similarity threshold for locations.
Similarity threshold for events. Lower to account for date variations.
Threshold Guidelines:
- 0.90+: Very strict, may create duplicates
- 0.80-0.85: Recommended for people
- 0.75-0.80: Recommended for organizations/locations
- 0.70-0.75: Recommended for events
- Below 0.70: May merge unrelated entities
Lexical Blocking
Lexical blocking uses fast fuzzy string matching (RapidFuzz) to filter candidates before expensive embedding similarity checks.Enable lexical blocking for faster deduplication. Highly recommended.
RapidFuzz similarity score cutoff (0-100). Only entities scoring above this
threshold proceed to embedding similarity checks.Recommended values:
- 70+: Very strict, may miss variants
- 60-70: Recommended for most cases
- 50-60: Looser matching, more candidates
Maximum number of candidate entities to evaluate with embeddings. Limits
computational cost for large entity sets.
Name Variants
Define equivalence groups for names that should always be treated as the same entity.List of name variant groups. Each group is an array of strings that should be
treated as the same entity. The first name in each group is used as the
canonical name.Supported entity types:
organizationslocationspeople(use sparingly)
Performance Configuration
Concurrency Settings
Number of articles to process in parallel. More workers = faster processing
but higher memory usage and API costs.Recommended values:
- 4-8: Good default for most systems
- 16+: Powerful systems with high API rate limits
- 1-2: Limited API rate limits or memory
Number of entity types to extract in parallel per article. With 4 entity types
(people, organizations, locations, events), setting this to 4 extracts all
types simultaneously.
Maximum concurrent cloud LLM API calls across all workers. Use this to respect
API rate limits.Recommended values:
- 16: Default for most cloud APIs
- 32+: High rate limit plans
- 4-8: Free tier or low rate limits
Maximum concurrent Ollama calls when using local models (
--local flag).
Limited by GPU memory and compute.Maximum articles buffered between extraction and merge phases. Provides
backpressure to prevent memory overflow.
Batching Configuration
Number of texts to batch in a single embedding API call during merge phase.
Larger batches are more efficient but require more memory.Recommended values:
- 32: Conservative, low memory
- 64: Default, good balance
- 100+: High-memory systems
Caching Configuration
Global Cache Settings
Master switch for all caching. Disable to force fresh processing (slower).
Embedding Cache
Maximum number of embedding vectors to cache in memory (LRU). Larger cache
reduces API calls but increases memory usage.
Extraction Cache
Enable persistent extraction result caching. Cached extractions are reused
when article content, model, prompt, and schema are unchanged.
Subdirectory under
output.directory for extraction cache files.Cache version number. Increment to invalidate all cached extractions after
changing prompts, entity definitions, or model settings.
Extraction cache is keyed on:
- Article content hash (SHA-256)
- Model name
- Prompt text
- Schema structure
- Temperature setting
- Cache version
Match Check Cache
Enable caching of LLM match check results (used during entity deduplication).
Maximum number of match check results to cache per processing run (in-memory LRU).
Article Cache
Skip processing articles whose content hash matches the last processing run.
Disable to reprocess all articles regardless of changes.
Processing Configuration
Enable relevance filtering before extraction. Articles determined not relevant
to your research domain are skipped. Uses
prompts/relevance.md prompt.Number of articles to process before writing intermediate results. Lower values
provide more frequent progress updates.
Embeddings Configuration
Embedding model selection strategy:
cloud: Always use cloud API embeddings (Jina AI)local: Always use local embeddings (requires PyTorch)auto: Use cloud if available, fall back to localhybrid: Use both (advanced)
Cloud Embeddings
Cloud embedding model identifier (LiteLLM format). Jina AI recommended for
quality and speed.
Maximum texts per embedding API call.
Number of retries for failed embedding API calls.
API call timeout in seconds.
Local Embeddings
Local embedding model from Hugging Face sentence-transformers. Requires
uv sync --extra local-embeddings.Batch size for local embedding generation. Limited by GPU memory.
Device for local embeddings:
auto: Automatically detect best device (CUDA > MPS > CPU)cpu: Force CPU (slow but compatible)cuda: Force CUDA GPU (requires NVIDIA GPU)mps: Force Apple Silicon GPU (requires M1/M2/M3 Mac)
Merge Evidence Configuration
Maximum total characters of evidence text to include when checking if two
entities should merge. Truncates long articles to save API costs.
Number of characters to include before and after each entity mention as
context evidence.
Maximum number of context windows to extract per article. Prevents extremely
long evidence for entities mentioned many times.
Legacy Settings
Legacy similarity threshold. Superseded by
dedup.similarity_thresholds.default.
Kept for backward compatibility.Complete Example
Here’s a completeconfig.yaml with all commonly used settings:
Configuration Tips
For Fast Processing
For Low API Costs
For High Accuracy
Next Steps
Creating Domains
Learn how to set up entity types and prompts
Processing Articles
Process your sources with optimized settings
Data Format
Prepare input data in Parquet format
Web Interface
Browse extracted entities