Configuration Reference

Each research domain has a config.yaml file that controls processing behavior, deduplication settings, performance tuning, and more. This page documents all available configuration options.

Configuration File Location

configs/your_domain/config.yaml

Example domains:

configs/guantanamo/config.yaml
configs/soviet_afghan_war/config.yaml
configs/template/config.yaml.template

Basic Configuration

Domain Identification

domain: "guantanamo"
description: "Guantánamo Bay detention and related issues"

domain

string

required

Domain identifier. Must match the directory name in configs/.

description

string

required

Human-readable description of the research domain. Displayed in the web interface.

Data Sources

data_sources:
  default_path: "data/guantanamo/raw_sources/miami_herald_articles.parquet"

data_sources.default_path

string

required

Path to the Parquet file containing your source articles. Must include columns: title, content, url, published_date, source_type.

Output Configuration

output:
  directory: "data/guantanamo/entities"

output.directory

string

required

Directory where extracted entity Parquet files will be saved. Files created:

people.parquet
organizations.parquet
locations.parquet
events.parquet

Deduplication Configuration

Similarity Thresholds

dedup:
  similarity_thresholds:
    default: 0.75
    people: 0.82
    organizations: 0.78
    locations: 0.80
    events: 0.76

dedup.similarity_thresholds.default

float

default:"0.75"

Default similarity threshold for all entity types (0.0-1.0). Higher values are more strict, requiring closer matches for deduplication.

dedup.similarity_thresholds.people

float

default:"0.82"

Similarity threshold for people. Recommended: 0.80-0.85 for strict name matching.

dedup.similarity_thresholds.organizations

float

default:"0.78"

Similarity threshold for organizations. Lower than people to account for acronyms and name variants.

dedup.similarity_thresholds.locations

float

default:"0.80"

Similarity threshold for locations.

dedup.similarity_thresholds.events

float

default:"0.76"

Similarity threshold for events. Lower to account for date variations.

Threshold Guidelines:

0.90+: Very strict, may create duplicates
0.80-0.85: Recommended for people
0.75-0.80: Recommended for organizations/locations
0.70-0.75: Recommended for events
Below 0.70: May merge unrelated entities

Lexical Blocking

Lexical blocking uses fast fuzzy string matching (RapidFuzz) to filter candidates before expensive embedding similarity checks.

dedup:
  lexical_blocking:
    enabled: true
    threshold: 60        # RapidFuzz score cutoff (0-100)
    max_candidates: 50   # Max entities to check with embeddings

dedup.lexical_blocking.enabled

boolean

default:"true"

Enable lexical blocking for faster deduplication. Highly recommended.

dedup.lexical_blocking.threshold

integer

default:"60"

RapidFuzz similarity score cutoff (0-100). Only entities scoring above this threshold proceed to embedding similarity checks.Recommended values:

70+: Very strict, may miss variants
60-70: Recommended for most cases
50-60: Looser matching, more candidates

dedup.lexical_blocking.max_candidates

integer

default:"50"

Maximum number of candidate entities to evaluate with embeddings. Limits computational cost for large entity sets.

Name Variants

Define equivalence groups for names that should always be treated as the same entity.

dedup:
  name_variants:
    organizations:
      equivalence_groups:
        - ["Department of Defense", "Defense Department", "DoD", "Pentagon"]
        - ["Central Intelligence Agency", "CIA"]
        - ["Federal Bureau of Investigation", "FBI"]
    locations:
      equivalence_groups:
        - ["Guantanamo Bay", "Guantanamo", "GTMO", "Guantanamo Bay Naval Base"]
        - ["United States", "U.S.", "US"]

dedup.name_variants.{entity_type}.equivalence_groups

array

List of name variant groups. Each group is an array of strings that should be treated as the same entity. The first name in each group is used as the canonical name.Supported entity types:

organizations
locations
people (use sparingly)

Name variant equivalence is applied before similarity checking. Only include variants you’re certain refer to the same entity. False equivalences cannot be corrected without reprocessing.

Performance Configuration

Concurrency Settings

performance:
  concurrency:
    extract_workers: 8        # Parallel articles in extraction phase
    extract_per_article: 4    # Parallel entity types within article
    llm_in_flight: 16         # Max concurrent cloud LLM calls
    ollama_in_flight: 2       # Max concurrent Ollama calls (local mode)
  queue:
    max_buffered_articles: 32  # Backpressure limit for extraction -> merge

performance.concurrency.extract_workers

integer

default:"8"

Number of articles to process in parallel. More workers = faster processing but higher memory usage and API costs.Recommended values:

4-8: Good default for most systems
16+: Powerful systems with high API rate limits
1-2: Limited API rate limits or memory

performance.concurrency.extract_per_article

integer

default:"4"

Number of entity types to extract in parallel per article. With 4 entity types (people, organizations, locations, events), setting this to 4 extracts all types simultaneously.

performance.concurrency.llm_in_flight

integer

default:"16"

Maximum concurrent cloud LLM API calls across all workers. Use this to respect API rate limits.Recommended values:

16: Default for most cloud APIs
32+: High rate limit plans
4-8: Free tier or low rate limits

performance.concurrency.ollama_in_flight

integer

default:"2"

Maximum concurrent Ollama calls when using local models (--local flag). Limited by GPU memory and compute.

performance.queue.max_buffered_articles

integer

default:"32"

Maximum articles buffered between extraction and merge phases. Provides backpressure to prevent memory overflow.

Batching Configuration

batching:
  embed_batch_size: 64        # Texts per embedding API call
  embed_drain_timeout_ms: 100 # Reserved for future async drain behavior

batching.embed_batch_size

integer

default:"64"

Number of texts to batch in a single embedding API call during merge phase. Larger batches are more efficient but require more memory.Recommended values:

32: Conservative, low memory
64: Default, good balance
100+: High-memory systems

Caching Configuration

Global Cache Settings

cache:
  enabled: true

cache.enabled

boolean

default:"true"

Master switch for all caching. Disable to force fresh processing (slower).

Embedding Cache

cache:
  embeddings:
    lru_max_items: 4096         # In-memory LRU cache size

cache.embeddings.lru_max_items

integer

default:"4096"

Maximum number of embedding vectors to cache in memory (LRU). Larger cache reduces API calls but increases memory usage.

Extraction Cache

cache:
  extraction:
    enabled: true
    subdir: "cache/extractions"  # Persistent sidecar under output dir
    version: 1                   # Bump to invalidate all cached extractions

cache.extraction.enabled

boolean

default:"true"

Enable persistent extraction result caching. Cached extractions are reused when article content, model, prompt, and schema are unchanged.

cache.extraction.subdir

string

default:"cache/extractions"

Subdirectory under output.directory for extraction cache files.

cache.extraction.version

integer

default:"1"

Cache version number. Increment to invalidate all cached extractions after changing prompts, entity definitions, or model settings.

Extraction cache is keyed on:

Article content hash (SHA-256)
Model name
Prompt text
Schema structure
Temperature setting
Cache version

Changing any of these invalidates the cache for that article.

Match Check Cache

cache:
  match_check:
    enabled: true
    max_items: 8192              # Per-run LRU cache size

cache.match_check.enabled

boolean

default:"true"

Enable caching of LLM match check results (used during entity deduplication).

cache.match_check.max_items

integer

default:"8192"

Maximum number of match check results to cache per processing run (in-memory LRU).

Article Cache

cache:
  articles:
    skip_if_unchanged: true      # Skip articles whose content hash hasn't changed

cache.articles.skip_if_unchanged

boolean

default:"true"

Skip processing articles whose content hash matches the last processing run. Disable to reprocess all articles regardless of changes.

Processing Configuration

processing:
  relevance_check: true
  batch_size: 5

processing.relevance_check

boolean

default:"true"

Enable relevance filtering before extraction. Articles determined not relevant to your research domain are skipped. Uses prompts/relevance.md prompt.

processing.batch_size

integer

default:"5"

Number of articles to process before writing intermediate results. Lower values provide more frequent progress updates.

Embeddings Configuration

embeddings:
  mode: cloud  # Options: auto, local, cloud, hybrid
  cloud:
    model: jina_ai/jina-embeddings-v3
    batch_size: 100
    max_retries: 3
    timeout: 30
  local:
    model: sentence-transformers/all-MiniLM-L6-v2
    batch_size: 32
    device: auto  # auto|cpu|cuda|mps

embeddings.mode

string

default:"cloud"

Embedding model selection strategy:

cloud: Always use cloud API embeddings (Jina AI)
local: Always use local embeddings (requires PyTorch)
auto: Use cloud if available, fall back to local
hybrid: Use both (advanced)

Cloud Embeddings

embeddings.cloud.model

string

default:"jina_ai/jina-embeddings-v3"

Cloud embedding model identifier (LiteLLM format). Jina AI recommended for quality and speed.

embeddings.cloud.batch_size

integer

default:"100"

Maximum texts per embedding API call.

embeddings.cloud.max_retries

integer

default:"3"

Number of retries for failed embedding API calls.

embeddings.cloud.timeout

integer

default:"30"

API call timeout in seconds.

Local Embeddings

embeddings.local.model

string

default:"sentence-transformers/all-MiniLM-L6-v2"

Local embedding model from Hugging Face sentence-transformers. Requires uv sync --extra local-embeddings.

embeddings.local.batch_size

integer

default:"32"

Batch size for local embedding generation. Limited by GPU memory.

embeddings.local.device

string

default:"auto"

Device for local embeddings:

auto: Automatically detect best device (CUDA > MPS > CPU)
cpu: Force CPU (slow but compatible)
cuda: Force CUDA GPU (requires NVIDIA GPU)
mps: Force Apple Silicon GPU (requires M1/M2/M3 Mac)

Merge Evidence Configuration

merge_evidence:
  max_chars: 1500         # Truncate evidence text to this length
  window_chars: 240       # Characters per context window around entity
  max_windows: 3          # Max context windows to extract from article

merge_evidence.max_chars

integer

default:"1500"

Maximum total characters of evidence text to include when checking if two entities should merge. Truncates long articles to save API costs.

merge_evidence.window_chars

integer

default:"240"

Number of characters to include before and after each entity mention as context evidence.

merge_evidence.max_windows

integer

default:"3"

Maximum number of context windows to extract per article. Prevents extremely long evidence for entities mentioned many times.

Legacy Settings

similarity_threshold: 0.75

similarity_threshold

float

default:"0.75"

Legacy similarity threshold. Superseded by dedup.similarity_thresholds.default. Kept for backward compatibility.

Complete Example

Here’s a complete config.yaml with all commonly used settings:

# Domain identification
domain: "guantanamo"
description: "Guantánamo Bay detention and related issues"

# Data sources
data_sources:
  default_path: "data/guantanamo/raw_sources/miami_herald_articles.parquet"

# Output configuration  
output:
  directory: "data/guantanamo/entities"

# Legacy threshold (superseded by dedup.similarity_thresholds)
similarity_threshold: 0.75

# Deduplication configuration
dedup:
  similarity_thresholds:
    default: 0.75
    people: 0.82
    organizations: 0.78
    locations: 0.80
    events: 0.76
  lexical_blocking:
    enabled: true
    threshold: 60
    max_candidates: 50
  name_variants:
    organizations:
      equivalence_groups:
        - ["Department of Defense", "Defense Department", "DoD", "Pentagon"]
        - ["Central Intelligence Agency", "CIA"]
    locations:
      equivalence_groups:
        - ["Guantanamo Bay", "Guantanamo", "GTMO"]

# Performance configuration
performance:
  concurrency:
    extract_workers: 8
    extract_per_article: 4
    llm_in_flight: 16
    ollama_in_flight: 2
  queue:
    max_buffered_articles: 32

# Batching configuration
batching:
  embed_batch_size: 64
  embed_drain_timeout_ms: 100

# Caching configuration
cache:
  enabled: true
  embeddings:
    lru_max_items: 4096
  extraction:
    enabled: true
    subdir: "cache/extractions"
    version: 1
  match_check:
    enabled: true
    max_items: 8192
  articles:
    skip_if_unchanged: true

# Merge evidence configuration
merge_evidence:
  max_chars: 1500
  window_chars: 240
  max_windows: 3

# Processing configuration
processing:
  relevance_check: true
  batch_size: 5

# Embeddings configuration
embeddings:
  mode: cloud
  cloud:
    model: jina_ai/jina-embeddings-v3
    batch_size: 100
    max_retries: 3
    timeout: 30
  local:
    model: sentence-transformers/all-MiniLM-L6-v2
    batch_size: 32
    device: auto

Configuration Tips

For Fast Processing

performance:
  concurrency:
    extract_workers: 16
    llm_in_flight: 32

cache:
  extraction:
    enabled: true

For Low API Costs

performance:
  concurrency:
    extract_workers: 4
    llm_in_flight: 8

processing:
  relevance_check: true  # Skip irrelevant articles

embeddings:
  mode: local  # Use local embeddings

For High Accuracy

dedup:
  similarity_thresholds:
    people: 0.85  # Stricter matching
    organizations: 0.82
  lexical_blocking:
    threshold: 70  # More strict filtering

processing:
  relevance_check: true

Next Steps

Creating Domains

Learn how to set up entity types and prompts

Processing Articles

Process your sources with optimized settings

Data Format

Prepare input data in Parquet format

Web Interface

Browse extracted entities

Get Started

Core Concepts

Guides

Advanced

Configuration Reference

Configuration File Location

Basic Configuration

Domain Identification

Data Sources

Output Configuration

Deduplication Configuration

Similarity Thresholds

Lexical Blocking

Name Variants

Performance Configuration

Concurrency Settings

Batching Configuration

Caching Configuration

Global Cache Settings

Embedding Cache

Extraction Cache

Match Check Cache

Article Cache

Processing Configuration

Embeddings Configuration

Cloud Embeddings

Local Embeddings

Merge Evidence Configuration

Legacy Settings

Complete Example

Configuration Tips

For Fast Processing

For Low API Costs

For High Accuracy

Next Steps

Creating Domains

Processing Articles

Data Format

Web Interface

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced

​Configuration File Location

​Basic Configuration

​Domain Identification

​Data Sources

​Output Configuration

​Deduplication Configuration

​Similarity Thresholds

​Lexical Blocking

​Name Variants

​Performance Configuration

​Concurrency Settings

​Batching Configuration

​Caching Configuration

​Global Cache Settings

​Embedding Cache

​Extraction Cache

​Match Check Cache

​Article Cache

​Processing Configuration

​Embeddings Configuration

​Cloud Embeddings

​Local Embeddings

​Merge Evidence Configuration

​Legacy Settings

​Complete Example

​Configuration Tips

​For Fast Processing

​For Low API Costs

​For High Accuracy

​Next Steps

Creating Domains

Processing Articles

Data Format

Web Interface

Build docs developers (and LLMs) love

Configuration File Location

Basic Configuration

Domain Identification

Data Sources

Output Configuration

Deduplication Configuration

Similarity Thresholds

Lexical Blocking

Name Variants

Performance Configuration

Concurrency Settings

Batching Configuration

Caching Configuration

Global Cache Settings

Embedding Cache

Extraction Cache

Match Check Cache

Article Cache

Processing Configuration

Embeddings Configuration

Cloud Embeddings

Local Embeddings

Merge Evidence Configuration

Legacy Settings

Complete Example

Configuration Tips

For Fast Processing

For Low API Costs

For High Accuracy

Next Steps