Research Domain Configuration

Hinbox uses domain-specific configurations to adapt the extraction pipeline to different historical research areas. Each domain defines its own entity types, prompts, and processing parameters without requiring code changes.

Domain Structure

A domain configuration consists of these files:

configs/<domain>/
├── config.yaml              # Main configuration
├── categories/              # Entity type definitions
│   ├── people.yaml
│   ├── organizations.yaml
│   ├── locations.yaml
│   └── events.yaml
└── prompts/                 # LLM extraction prompts
    ├── people.md
    ├── organizations.md
    ├── locations.md
    ├── events.md
    ├── relevance.md
    ├── profile_generation.md
    ├── profile_update.md
    └── profile_reflection.md

Creating a New Domain

Initialize domain from template

just init palestine_food_history

This copies the template to configs/palestine_food_history/ with generic categories and prompts.

Configure domain settings

Edit config.yaml to set research focus, data paths, and thresholds.

Define entity types

Edit categories/*.yaml files to specify relevant entity types and tags for your research domain.

Customize extraction prompts

Edit prompts/*.md files to provide domain-specific extraction instructions.

Test with sample articles

just process --domain palestine_food_history --limit 2 --verbose

Main Configuration (config.yaml)

The main configuration file controls data paths, thresholds, and performance settings.

Basic Settings

domain: "guantanamo"
description: "Guantánamo Bay detention and related issues"

# Data source configuration
data_sources:
  default_path: "data/guantanamo/raw_sources/miami_herald_articles.parquet"

# Output configuration  
output:
  directory: "data/guantanamo/entities"

domain

string

required

Unique identifier for this research domain (lowercase, underscores allowed)

description

string

Brief description of the research focus

data_sources.default_path

string

required

Path to source articles Parquet file

output.directory

string

required

Directory for entity Parquet files and cache

Deduplication Configuration

Control how entities are deduplicated during the merge stage:

dedup:
  similarity_thresholds:
    default: 0.75
    people: 0.82
    organizations: 0.78
    locations: 0.80
    events: 0.76
  
  lexical_blocking:
    enabled: true
    threshold: 60        # RapidFuzz score cutoff (0-100)
    max_candidates: 50   # max entities to run cosine similarity against
  
  name_variants:
    organizations:
      equivalence_groups:
        - ["Department of Defense", "Defense Department", "DoD", "Pentagon"]
        - ["American Civil Liberties Union", "ACLU"]
    locations:
      equivalence_groups:
        - ["Guantanamo Bay", "Guantanamo", "GTMO"]
        - ["United States", "U.S.", "US"]

similarity_thresholds

Cosine similarity thresholds (0.0-1.0) for embedding-based merge decisions. Higher values require stronger similarity before merging.Recommendations:

People: 0.80-0.85 (names vary widely, require high confidence)
Organizations: 0.75-0.80 (moderate variation)
Locations: 0.78-0.82 (geographic specificity important)
Events: 0.70-0.78 (temporal context helps disambiguation)

lexical_blocking

Fast fuzzy string matching to narrow merge candidates before expensive embedding/LLM checks.

enabled: Enable/disable lexical blocking (default: true)
threshold: RapidFuzz score 0-100 (default: 60)
max_candidates: Maximum entities to pass to embedding stage (default: 50)

Location: src/engine/mergers.py (uses RapidFuzz library)

name_variants.equivalence_groups

Define known aliases for instant blocking without LLM calls. When any name in a group is encountered, all variants are considered equivalent.Use cases:

Government agencies: ["DoD", "Department of Defense", "Pentagon"]
Acronyms: ["ACLU", "American Civil Liberties Union"]
Geographic variants: ["U.S.", "United States", "USA"]

Location: src/utils/name_variants.py:names_likely_same

Performance Configuration

performance:
  concurrency:
    extract_workers: 8        # parallel articles in extraction phase
    extract_per_article: 4    # parallel entity types within article
    llm_in_flight: 16         # max concurrent cloud LLM calls
    ollama_in_flight: 2       # max concurrent Ollama calls (local mode)
  queue:
    max_buffered_articles: 32  # backpressure limit for extraction → merge

# Batching configuration (for embedding calls during merge)
batching:
  embed_batch_size: 64        # texts per embedding API call
  embed_drain_timeout_ms: 100 # reserved for future async drain behaviour

Tuning guidelines:

extract_workers: Set to number of CPU cores for local mode, or 8-16 for cloud APIs
extract_per_article: Usually 4 (all entity types in parallel), set to 1 for debugging
llm_in_flight: Balance between throughput and API rate limits (Gemini: 16-32)
ollama_in_flight: Keep low (1-2) to avoid overwhelming local GPU
embed_batch_size: Larger batches reduce API calls but increase latency (32-128)

Caching Configuration

cache:
  enabled: true
  embeddings:
    lru_max_items: 4096         # in-memory LRU for embedding vectors
  extraction:
    enabled: true
    subdir: "cache/extractions"  # persistent sidecar under output dir
    version: 1                   # bump to invalidate all cached extractions
  match_check:
    enabled: true
    max_items: 8192              # per-run LRU for match-checker results
  articles:
    skip_if_unchanged: true      # skip articles whose content hash hasn't changed

Cache invalidation: Bump cache.extraction.version when you:

Change extraction prompts
Modify entity type definitions
Upgrade to a new LLM model
Change temperature or other generation parameters

Merge Evidence Configuration

Control how evidence text is built for embedding-based similarity:

merge_evidence:
  max_chars: 1500         # truncate evidence text to this length
  window_chars: 240       # characters per context window around mentions
  max_windows: 3          # max context windows to extract from article

Location: src/engine/mergers.py:_build_evidence_text Evidence text format:

"John Doe, detainee, described in source as: 'John Doe was 
transferred to Guantanamo in 2002 after being captured in 
Afghanistan...' [240 chars] '...Doe's lawyer filed a habeas 
corpus petition...' [240 chars]"

Category Definitions (categories/*.yaml)

Each entity type has a YAML file defining valid types and tags.

Structure

person_types:
  detainee:
    description: "A person who is or was detained at Guantánamo Bay"
    examples: ["Mohamedou Ould Slahi", "David Hicks"]
  
  lawyer:
    description: "Attorneys, legal representatives, and legal professionals"
    examples: ["Clive Stafford Smith", "Gitanjali Gutierrez"]
  
  journalist:
    description: "Reporters, writers, and media professionals"
    examples: ["Carol Rosenberg", "Andy Worthington"]

person_tags:
  civil_rights:
    description: "People involved in civil rights advocacy"
    examples: ["ACLU lawyers", "Human rights activists"]
  
  defense:
    description: "People involved in defense or military defense roles"
    examples: ["Defense attorneys", "Military defense counsel"]

Category File Conventions

Naming:

File names: people.yaml, organizations.yaml, locations.yaml, events.yaml
Type keys: person_types, organization_types, location_types, event_types
Tag keys: person_tags, event_tags (organizations/locations don’t have tags)

Field requirements:

description: Clear, concise definition of the type/tag
examples: 2-3 real examples from your domain

Type naming:

Use lowercase with underscores: detention_facility, military_operation
Be specific to your domain: “detainee” vs. generic “prisoner”
Include an other type as a catch-all

Example: Guantanamo Domain Categories

People
Organizations
Locations
Events

Types: detainee, military, government, lawyer, journalist, otherTags: civil_rights, immigration, defense, prosecution, policy, medical, intelligence, academic, religious, family, activistLocation: configs/guantanamo/categories/people.yaml

Types: military, intelligence, legal, humanitarian, advocacy, media, government, intergovernmental, otherExamples:

Military: “Joint Task Force Guantanamo”, “US Navy”
Legal: “ACLU”, “Center for Constitutional Rights”
Humanitarian: “International Red Cross”, “Physicians for Human Rights”

Location: configs/guantanamo/categories/organizations.yaml

Types: detention_facility, military_base, city, country, region, courthouse, otherExamples:

Detention facility: “Guantanamo Bay”, “Camp Delta”, “Camp X-Ray”
Military base: “Naval Station Guantanamo Bay”, “Bagram Air Base”

Location: configs/guantanamo/categories/locations.yaml

Types: detention, legal, military_operation, policy_change, protest, investigation, release, medical, otherTags: torture, force_feeding, hunger_strike, transfer, interrogation, legal_challenge, habeas_corpus, medical_care, isolation, suicide_attempt, abuse, official_statementLocation: configs/guantanamo/categories/events.yaml

Extraction Prompts (prompts/*.md)

Prompts are written in Markdown and use placeholders for runtime interpolation.

Entity Extraction Prompt

# Extract People from Guantánamo Bay Articles

You are analyzing historical newspaper articles about Guantánamo Bay detention.

## Task
Extract all people mentioned in the article text. For each person, provide:

- **name**: Full name as it appears (e.g., "Carol Rosenberg", "Lt. Col. Morris Davis")
- **type**: One of: {person_types}
- **person_tags**: List of relevant tags from: {person_tags}
- **description**: Brief description of their role or relevance (1-2 sentences)
- **alternative_names**: List of alternative names, titles, or abbreviations

## Guidelines

1. **Use specific names**: Extract "Lt. Col. Morris Davis", not "military prosecutor"
2. **Include all mentions**: Even brief references count
3. **Preserve titles**: Keep military ranks, professional titles ("Dr.", "Judge")
4. **Tag accurately**: Use tags that reflect the person's role in this article
5. **No hallucinations**: Only extract people actually mentioned in the text

## Output Format

Return a JSON array of person objects. Example:

```json
[
  {
    "name": "Carol Rosenberg",
    "type": "journalist",
    "person_tags": [],
    "description": "Miami Herald reporter covering Guantanamo Bay detention issues",
    "alternative_names": []
  },
  {
    "name": "Mohamedou Ould Slahi",
    "type": "detainee",
    "person_tags": ["civil_rights"],
    "description": "Guantanamo detainee who wrote a memoir about his detention",
    "alternative_names": ["Slahi"]
  }
]

Available DomainConfig Methods

load_config()

Dict

Load main config.yaml

load_categories(entity_type)

Dict

Load category YAML (people.yaml, organizations.yaml, etc.)

load_prompt(entity_type)

str

Load extraction prompt Markdown

load_profile_prompt(prompt_type)

str

Load profile prompt (“generation”, “update”, “reflection”)

get_similarity_threshold(entity_type?)

float

Get similarity threshold with fallback to default

get_lexical_blocking_config(entity_type?)

Dict

Get lexical blocking settings with per-type overrides

get_concurrency_config()

Dict

Get worker/thread counts and queue limits

get_cache_config()

Dict

Get cache settings (extraction, embeddings, match_check, articles)

get_name_variants_config(entity_type)

Dict

Get equivalence groups and acronym stopwords

Example: Custom Domain Configuration

Here’s how you might configure a domain for Soviet-Afghan War research:

domain: "soviet_afghan_war"
description: "Soviet-Afghan War (1979-1989) military and political history"

data_sources:
  default_path: "data/soviet_afghan_war/raw_sources/articles.parquet"

output:
  directory: "data/soviet_afghan_war/entities"

dedup:
  similarity_thresholds:
    default: 0.75
    people: 0.80        # military names vary
    organizations: 0.78
    locations: 0.82     # geographic precision important
    events: 0.74        # battles have similar descriptions
  
  name_variants:
    organizations:
      equivalence_groups:
        - ["KGB", "Committee for State Security"]
        - ["Mujahideen", "Afghan Resistance"]
    locations:
      equivalence_groups:
        - ["Afghanistan", "Islamic Republic of Afghanistan"]
        - ["Kabul", "Kabul Province"]

performance:
  concurrency:
    extract_workers: 4  # conservative for local mode
    extract_per_article: 2
    llm_in_flight: 8

Best Practices

Start with the template

Always initialize new domains from the template:

just init my_research_domain

The template provides a working baseline with sensible defaults.

Test incrementally

Process a small batch (2-5 articles) after each configuration change:

just process --domain my_research_domain --limit 2 --verbose

Review extraction quality before processing the full corpus.

Be specific in prompts

Generic prompts produce generic results. Include:

Domain context: “You are analyzing articles about X…”
Real examples: Show the LLM what good output looks like
Edge cases: Address common errors (“Don’t extract ‘military bases’ as a name…”)

Tune thresholds empirically

Default similarity thresholds (0.75-0.82) work well for many domains, but:

Increase for domains with high name ambiguity (common names, many variants)
Decrease for domains with consistent naming (technical terms, proper nouns)
Monitor merge decisions in logs to identify false positives/negatives

Version extraction cache

Bump cache.extraction.version when you change:

Extraction prompts (even minor wording changes)
Category definitions (types or tags)
LLM model or temperature

This ensures re-extraction with the new configuration.

Next Steps

Processing Pipeline

Learn how domain configurations drive the 5-stage pipeline

Entity Types

Understand entity structure and required fields

Quickstart

Process your first batch of articles with a custom domain

System Architecture

Explore the producer-consumer model and evidence-first merge

Get Started

Core Concepts

Guides

Advanced

Research Domain Configuration

Domain Structure

Creating a New Domain

Main Configuration (config.yaml)

Basic Settings

Deduplication Configuration

Performance Configuration

Caching Configuration

Merge Evidence Configuration

Category Definitions (categories/*.yaml)

Structure

Example: Guantanamo Domain Categories

Extraction Prompts (prompts/*.md)

Entity Extraction Prompt

Example: Custom Domain Configuration

Best Practices

Next Steps

Processing Pipeline

Entity Types

Quickstart

System Architecture

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced

​Domain Structure

​Creating a New Domain

​Main Configuration (config.yaml)

​Basic Settings

​Deduplication Configuration

​Performance Configuration

​Caching Configuration

​Merge Evidence Configuration

​Category Definitions (categories/*.yaml)

​Structure

​Example: Guantanamo Domain Categories

​Extraction Prompts (prompts/*.md)

​Entity Extraction Prompt

​Example: Custom Domain Configuration

​Best Practices

​Next Steps

Processing Pipeline

Entity Types

Quickstart

System Architecture

Build docs developers (and LLMs) love

Domain Structure

Creating a New Domain

Main Configuration (config.yaml)

Basic Settings

Deduplication Configuration

Performance Configuration

Caching Configuration

Merge Evidence Configuration

Category Definitions (categories/*.yaml)

Structure

Example: Guantanamo Domain Categories

Extraction Prompts (prompts/*.md)

Entity Extraction Prompt

Example: Custom Domain Configuration

Best Practices

Next Steps