Skip to main content
Each research domain has a config.yaml file that controls processing behavior, deduplication settings, performance tuning, and more. This page documents all available configuration options.

Configuration File Location

configs/your_domain/config.yaml
Example domains:
  • configs/guantanamo/config.yaml
  • configs/soviet_afghan_war/config.yaml
  • configs/template/config.yaml.template

Basic Configuration

Domain Identification

domain: "guantanamo"
description: "Guantánamo Bay detention and related issues"
domain
string
required
Domain identifier. Must match the directory name in configs/.
description
string
required
Human-readable description of the research domain. Displayed in the web interface.

Data Sources

data_sources:
  default_path: "data/guantanamo/raw_sources/miami_herald_articles.parquet"
data_sources.default_path
string
required
Path to the Parquet file containing your source articles. Must include columns: title, content, url, published_date, source_type.

Output Configuration

output:
  directory: "data/guantanamo/entities"
output.directory
string
required
Directory where extracted entity Parquet files will be saved. Files created:
  • people.parquet
  • organizations.parquet
  • locations.parquet
  • events.parquet

Deduplication Configuration

Similarity Thresholds

dedup:
  similarity_thresholds:
    default: 0.75
    people: 0.82
    organizations: 0.78
    locations: 0.80
    events: 0.76
dedup.similarity_thresholds.default
float
default:"0.75"
Default similarity threshold for all entity types (0.0-1.0). Higher values are more strict, requiring closer matches for deduplication.
dedup.similarity_thresholds.people
float
default:"0.82"
Similarity threshold for people. Recommended: 0.80-0.85 for strict name matching.
dedup.similarity_thresholds.organizations
float
default:"0.78"
Similarity threshold for organizations. Lower than people to account for acronyms and name variants.
dedup.similarity_thresholds.locations
float
default:"0.80"
Similarity threshold for locations.
dedup.similarity_thresholds.events
float
default:"0.76"
Similarity threshold for events. Lower to account for date variations.
Threshold Guidelines:
  • 0.90+: Very strict, may create duplicates
  • 0.80-0.85: Recommended for people
  • 0.75-0.80: Recommended for organizations/locations
  • 0.70-0.75: Recommended for events
  • Below 0.70: May merge unrelated entities

Lexical Blocking

Lexical blocking uses fast fuzzy string matching (RapidFuzz) to filter candidates before expensive embedding similarity checks.
dedup:
  lexical_blocking:
    enabled: true
    threshold: 60        # RapidFuzz score cutoff (0-100)
    max_candidates: 50   # Max entities to check with embeddings
dedup.lexical_blocking.enabled
boolean
default:"true"
Enable lexical blocking for faster deduplication. Highly recommended.
dedup.lexical_blocking.threshold
integer
default:"60"
RapidFuzz similarity score cutoff (0-100). Only entities scoring above this threshold proceed to embedding similarity checks.Recommended values:
  • 70+: Very strict, may miss variants
  • 60-70: Recommended for most cases
  • 50-60: Looser matching, more candidates
dedup.lexical_blocking.max_candidates
integer
default:"50"
Maximum number of candidate entities to evaluate with embeddings. Limits computational cost for large entity sets.

Name Variants

Define equivalence groups for names that should always be treated as the same entity.
dedup:
  name_variants:
    organizations:
      equivalence_groups:
        - ["Department of Defense", "Defense Department", "DoD", "Pentagon"]
        - ["Central Intelligence Agency", "CIA"]
        - ["Federal Bureau of Investigation", "FBI"]
    locations:
      equivalence_groups:
        - ["Guantanamo Bay", "Guantanamo", "GTMO", "Guantanamo Bay Naval Base"]
        - ["United States", "U.S.", "US"]
dedup.name_variants.{entity_type}.equivalence_groups
array
List of name variant groups. Each group is an array of strings that should be treated as the same entity. The first name in each group is used as the canonical name.Supported entity types:
  • organizations
  • locations
  • people (use sparingly)
Name variant equivalence is applied before similarity checking. Only include variants you’re certain refer to the same entity. False equivalences cannot be corrected without reprocessing.

Performance Configuration

Concurrency Settings

performance:
  concurrency:
    extract_workers: 8        # Parallel articles in extraction phase
    extract_per_article: 4    # Parallel entity types within article
    llm_in_flight: 16         # Max concurrent cloud LLM calls
    ollama_in_flight: 2       # Max concurrent Ollama calls (local mode)
  queue:
    max_buffered_articles: 32  # Backpressure limit for extraction -> merge
performance.concurrency.extract_workers
integer
default:"8"
Number of articles to process in parallel. More workers = faster processing but higher memory usage and API costs.Recommended values:
  • 4-8: Good default for most systems
  • 16+: Powerful systems with high API rate limits
  • 1-2: Limited API rate limits or memory
performance.concurrency.extract_per_article
integer
default:"4"
Number of entity types to extract in parallel per article. With 4 entity types (people, organizations, locations, events), setting this to 4 extracts all types simultaneously.
performance.concurrency.llm_in_flight
integer
default:"16"
Maximum concurrent cloud LLM API calls across all workers. Use this to respect API rate limits.Recommended values:
  • 16: Default for most cloud APIs
  • 32+: High rate limit plans
  • 4-8: Free tier or low rate limits
performance.concurrency.ollama_in_flight
integer
default:"2"
Maximum concurrent Ollama calls when using local models (--local flag). Limited by GPU memory and compute.
performance.queue.max_buffered_articles
integer
default:"32"
Maximum articles buffered between extraction and merge phases. Provides backpressure to prevent memory overflow.

Batching Configuration

batching:
  embed_batch_size: 64        # Texts per embedding API call
  embed_drain_timeout_ms: 100 # Reserved for future async drain behavior
batching.embed_batch_size
integer
default:"64"
Number of texts to batch in a single embedding API call during merge phase. Larger batches are more efficient but require more memory.Recommended values:
  • 32: Conservative, low memory
  • 64: Default, good balance
  • 100+: High-memory systems

Caching Configuration

Global Cache Settings

cache:
  enabled: true
cache.enabled
boolean
default:"true"
Master switch for all caching. Disable to force fresh processing (slower).

Embedding Cache

cache:
  embeddings:
    lru_max_items: 4096         # In-memory LRU cache size
cache.embeddings.lru_max_items
integer
default:"4096"
Maximum number of embedding vectors to cache in memory (LRU). Larger cache reduces API calls but increases memory usage.

Extraction Cache

cache:
  extraction:
    enabled: true
    subdir: "cache/extractions"  # Persistent sidecar under output dir
    version: 1                   # Bump to invalidate all cached extractions
cache.extraction.enabled
boolean
default:"true"
Enable persistent extraction result caching. Cached extractions are reused when article content, model, prompt, and schema are unchanged.
cache.extraction.subdir
string
default:"cache/extractions"
Subdirectory under output.directory for extraction cache files.
cache.extraction.version
integer
default:"1"
Cache version number. Increment to invalidate all cached extractions after changing prompts, entity definitions, or model settings.
Extraction cache is keyed on:
  • Article content hash (SHA-256)
  • Model name
  • Prompt text
  • Schema structure
  • Temperature setting
  • Cache version
Changing any of these invalidates the cache for that article.

Match Check Cache

cache:
  match_check:
    enabled: true
    max_items: 8192              # Per-run LRU cache size
cache.match_check.enabled
boolean
default:"true"
Enable caching of LLM match check results (used during entity deduplication).
cache.match_check.max_items
integer
default:"8192"
Maximum number of match check results to cache per processing run (in-memory LRU).

Article Cache

cache:
  articles:
    skip_if_unchanged: true      # Skip articles whose content hash hasn't changed
cache.articles.skip_if_unchanged
boolean
default:"true"
Skip processing articles whose content hash matches the last processing run. Disable to reprocess all articles regardless of changes.

Processing Configuration

processing:
  relevance_check: true
  batch_size: 5
processing.relevance_check
boolean
default:"true"
Enable relevance filtering before extraction. Articles determined not relevant to your research domain are skipped. Uses prompts/relevance.md prompt.
processing.batch_size
integer
default:"5"
Number of articles to process before writing intermediate results. Lower values provide more frequent progress updates.

Embeddings Configuration

embeddings:
  mode: cloud  # Options: auto, local, cloud, hybrid
  cloud:
    model: jina_ai/jina-embeddings-v3
    batch_size: 100
    max_retries: 3
    timeout: 30
  local:
    model: sentence-transformers/all-MiniLM-L6-v2
    batch_size: 32
    device: auto  # auto|cpu|cuda|mps
embeddings.mode
string
default:"cloud"
Embedding model selection strategy:
  • cloud: Always use cloud API embeddings (Jina AI)
  • local: Always use local embeddings (requires PyTorch)
  • auto: Use cloud if available, fall back to local
  • hybrid: Use both (advanced)

Cloud Embeddings

embeddings.cloud.model
string
default:"jina_ai/jina-embeddings-v3"
Cloud embedding model identifier (LiteLLM format). Jina AI recommended for quality and speed.
embeddings.cloud.batch_size
integer
default:"100"
Maximum texts per embedding API call.
embeddings.cloud.max_retries
integer
default:"3"
Number of retries for failed embedding API calls.
embeddings.cloud.timeout
integer
default:"30"
API call timeout in seconds.

Local Embeddings

embeddings.local.model
string
default:"sentence-transformers/all-MiniLM-L6-v2"
Local embedding model from Hugging Face sentence-transformers. Requires uv sync --extra local-embeddings.
embeddings.local.batch_size
integer
default:"32"
Batch size for local embedding generation. Limited by GPU memory.
embeddings.local.device
string
default:"auto"
Device for local embeddings:
  • auto: Automatically detect best device (CUDA > MPS > CPU)
  • cpu: Force CPU (slow but compatible)
  • cuda: Force CUDA GPU (requires NVIDIA GPU)
  • mps: Force Apple Silicon GPU (requires M1/M2/M3 Mac)

Merge Evidence Configuration

merge_evidence:
  max_chars: 1500         # Truncate evidence text to this length
  window_chars: 240       # Characters per context window around entity
  max_windows: 3          # Max context windows to extract from article
merge_evidence.max_chars
integer
default:"1500"
Maximum total characters of evidence text to include when checking if two entities should merge. Truncates long articles to save API costs.
merge_evidence.window_chars
integer
default:"240"
Number of characters to include before and after each entity mention as context evidence.
merge_evidence.max_windows
integer
default:"3"
Maximum number of context windows to extract per article. Prevents extremely long evidence for entities mentioned many times.

Legacy Settings

similarity_threshold: 0.75
similarity_threshold
float
default:"0.75"
Legacy similarity threshold. Superseded by dedup.similarity_thresholds.default. Kept for backward compatibility.

Complete Example

Here’s a complete config.yaml with all commonly used settings:
# Domain identification
domain: "guantanamo"
description: "Guantánamo Bay detention and related issues"

# Data sources
data_sources:
  default_path: "data/guantanamo/raw_sources/miami_herald_articles.parquet"

# Output configuration  
output:
  directory: "data/guantanamo/entities"

# Legacy threshold (superseded by dedup.similarity_thresholds)
similarity_threshold: 0.75

# Deduplication configuration
dedup:
  similarity_thresholds:
    default: 0.75
    people: 0.82
    organizations: 0.78
    locations: 0.80
    events: 0.76
  lexical_blocking:
    enabled: true
    threshold: 60
    max_candidates: 50
  name_variants:
    organizations:
      equivalence_groups:
        - ["Department of Defense", "Defense Department", "DoD", "Pentagon"]
        - ["Central Intelligence Agency", "CIA"]
    locations:
      equivalence_groups:
        - ["Guantanamo Bay", "Guantanamo", "GTMO"]

# Performance configuration
performance:
  concurrency:
    extract_workers: 8
    extract_per_article: 4
    llm_in_flight: 16
    ollama_in_flight: 2
  queue:
    max_buffered_articles: 32

# Batching configuration
batching:
  embed_batch_size: 64
  embed_drain_timeout_ms: 100

# Caching configuration
cache:
  enabled: true
  embeddings:
    lru_max_items: 4096
  extraction:
    enabled: true
    subdir: "cache/extractions"
    version: 1
  match_check:
    enabled: true
    max_items: 8192
  articles:
    skip_if_unchanged: true

# Merge evidence configuration
merge_evidence:
  max_chars: 1500
  window_chars: 240
  max_windows: 3

# Processing configuration
processing:
  relevance_check: true
  batch_size: 5

# Embeddings configuration
embeddings:
  mode: cloud
  cloud:
    model: jina_ai/jina-embeddings-v3
    batch_size: 100
    max_retries: 3
    timeout: 30
  local:
    model: sentence-transformers/all-MiniLM-L6-v2
    batch_size: 32
    device: auto

Configuration Tips

For Fast Processing

performance:
  concurrency:
    extract_workers: 16
    llm_in_flight: 32

cache:
  extraction:
    enabled: true

For Low API Costs

performance:
  concurrency:
    extract_workers: 4
    llm_in_flight: 8

processing:
  relevance_check: true  # Skip irrelevant articles

embeddings:
  mode: local  # Use local embeddings

For High Accuracy

dedup:
  similarity_thresholds:
    people: 0.85  # Stricter matching
    organizations: 0.82
  lexical_blocking:
    threshold: 70  # More strict filtering

processing:
  relevance_check: true

Next Steps

Creating Domains

Learn how to set up entity types and prompts

Processing Articles

Process your sources with optimized settings

Data Format

Prepare input data in Parquet format

Web Interface

Browse extracted entities

Build docs developers (and LLMs) love