Skip to main content
Hinbox uses domain-specific configurations to adapt the extraction pipeline to different historical research areas. Each domain defines its own entity types, prompts, and processing parameters without requiring code changes.

Domain Structure

A domain configuration consists of these files:
configs/<domain>/
├── config.yaml              # Main configuration
├── categories/              # Entity type definitions
│   ├── people.yaml
│   ├── organizations.yaml
│   ├── locations.yaml
│   └── events.yaml
└── prompts/                 # LLM extraction prompts
    ├── people.md
    ├── organizations.md
    ├── locations.md
    ├── events.md
    ├── relevance.md
    ├── profile_generation.md
    ├── profile_update.md
    └── profile_reflection.md

Creating a New Domain

1

Initialize domain from template

just init palestine_food_history
This copies the template to configs/palestine_food_history/ with generic categories and prompts.
2

Configure domain settings

Edit config.yaml to set research focus, data paths, and thresholds.
3

Define entity types

Edit categories/*.yaml files to specify relevant entity types and tags for your research domain.
4

Customize extraction prompts

Edit prompts/*.md files to provide domain-specific extraction instructions.
5

Test with sample articles

just process --domain palestine_food_history --limit 2 --verbose

Main Configuration (config.yaml)

The main configuration file controls data paths, thresholds, and performance settings.

Basic Settings

domain: "guantanamo"
description: "Guantánamo Bay detention and related issues"

# Data source configuration
data_sources:
  default_path: "data/guantanamo/raw_sources/miami_herald_articles.parquet"

# Output configuration  
output:
  directory: "data/guantanamo/entities"
domain
string
required
Unique identifier for this research domain (lowercase, underscores allowed)
description
string
Brief description of the research focus
data_sources.default_path
string
required
Path to source articles Parquet file
output.directory
string
required
Directory for entity Parquet files and cache

Deduplication Configuration

Control how entities are deduplicated during the merge stage:
dedup:
  similarity_thresholds:
    default: 0.75
    people: 0.82
    organizations: 0.78
    locations: 0.80
    events: 0.76
  
  lexical_blocking:
    enabled: true
    threshold: 60        # RapidFuzz score cutoff (0-100)
    max_candidates: 50   # max entities to run cosine similarity against
  
  name_variants:
    organizations:
      equivalence_groups:
        - ["Department of Defense", "Defense Department", "DoD", "Pentagon"]
        - ["American Civil Liberties Union", "ACLU"]
    locations:
      equivalence_groups:
        - ["Guantanamo Bay", "Guantanamo", "GTMO"]
        - ["United States", "U.S.", "US"]
Cosine similarity thresholds (0.0-1.0) for embedding-based merge decisions. Higher values require stronger similarity before merging.Recommendations:
  • People: 0.80-0.85 (names vary widely, require high confidence)
  • Organizations: 0.75-0.80 (moderate variation)
  • Locations: 0.78-0.82 (geographic specificity important)
  • Events: 0.70-0.78 (temporal context helps disambiguation)
Fast fuzzy string matching to narrow merge candidates before expensive embedding/LLM checks.
  • enabled: Enable/disable lexical blocking (default: true)
  • threshold: RapidFuzz score 0-100 (default: 60)
  • max_candidates: Maximum entities to pass to embedding stage (default: 50)
Location: src/engine/mergers.py (uses RapidFuzz library)
Define known aliases for instant blocking without LLM calls. When any name in a group is encountered, all variants are considered equivalent.Use cases:
  • Government agencies: ["DoD", "Department of Defense", "Pentagon"]
  • Acronyms: ["ACLU", "American Civil Liberties Union"]
  • Geographic variants: ["U.S.", "United States", "USA"]
Location: src/utils/name_variants.py:names_likely_same

Performance Configuration

performance:
  concurrency:
    extract_workers: 8        # parallel articles in extraction phase
    extract_per_article: 4    # parallel entity types within article
    llm_in_flight: 16         # max concurrent cloud LLM calls
    ollama_in_flight: 2       # max concurrent Ollama calls (local mode)
  queue:
    max_buffered_articles: 32  # backpressure limit for extraction → merge

# Batching configuration (for embedding calls during merge)
batching:
  embed_batch_size: 64        # texts per embedding API call
  embed_drain_timeout_ms: 100 # reserved for future async drain behaviour
Tuning guidelines:
  • extract_workers: Set to number of CPU cores for local mode, or 8-16 for cloud APIs
  • extract_per_article: Usually 4 (all entity types in parallel), set to 1 for debugging
  • llm_in_flight: Balance between throughput and API rate limits (Gemini: 16-32)
  • ollama_in_flight: Keep low (1-2) to avoid overwhelming local GPU
  • embed_batch_size: Larger batches reduce API calls but increase latency (32-128)

Caching Configuration

cache:
  enabled: true
  embeddings:
    lru_max_items: 4096         # in-memory LRU for embedding vectors
  extraction:
    enabled: true
    subdir: "cache/extractions"  # persistent sidecar under output dir
    version: 1                   # bump to invalidate all cached extractions
  match_check:
    enabled: true
    max_items: 8192              # per-run LRU for match-checker results
  articles:
    skip_if_unchanged: true      # skip articles whose content hash hasn't changed
Cache invalidation: Bump cache.extraction.version when you:
  • Change extraction prompts
  • Modify entity type definitions
  • Upgrade to a new LLM model
  • Change temperature or other generation parameters

Merge Evidence Configuration

Control how evidence text is built for embedding-based similarity:
merge_evidence:
  max_chars: 1500         # truncate evidence text to this length
  window_chars: 240       # characters per context window around mentions
  max_windows: 3          # max context windows to extract from article
Location: src/engine/mergers.py:_build_evidence_text Evidence text format:
"John Doe, detainee, described in source as: 'John Doe was 
transferred to Guantanamo in 2002 after being captured in 
Afghanistan...' [240 chars] '...Doe's lawyer filed a habeas 
corpus petition...' [240 chars]"

Category Definitions (categories/*.yaml)

Each entity type has a YAML file defining valid types and tags.

Structure

person_types:
  detainee:
    description: "A person who is or was detained at Guantánamo Bay"
    examples: ["Mohamedou Ould Slahi", "David Hicks"]
  
  lawyer:
    description: "Attorneys, legal representatives, and legal professionals"
    examples: ["Clive Stafford Smith", "Gitanjali Gutierrez"]
  
  journalist:
    description: "Reporters, writers, and media professionals"
    examples: ["Carol Rosenberg", "Andy Worthington"]

person_tags:
  civil_rights:
    description: "People involved in civil rights advocacy"
    examples: ["ACLU lawyers", "Human rights activists"]
  
  defense:
    description: "People involved in defense or military defense roles"
    examples: ["Defense attorneys", "Military defense counsel"]
Naming:
  • File names: people.yaml, organizations.yaml, locations.yaml, events.yaml
  • Type keys: person_types, organization_types, location_types, event_types
  • Tag keys: person_tags, event_tags (organizations/locations don’t have tags)
Field requirements:
  • description: Clear, concise definition of the type/tag
  • examples: 2-3 real examples from your domain
Type naming:
  • Use lowercase with underscores: detention_facility, military_operation
  • Be specific to your domain: “detainee” vs. generic “prisoner”
  • Include an other type as a catch-all

Example: Guantanamo Domain Categories

Types: detainee, military, government, lawyer, journalist, otherTags: civil_rights, immigration, defense, prosecution, policy, medical, intelligence, academic, religious, family, activistLocation: configs/guantanamo/categories/people.yaml

Extraction Prompts (prompts/*.md)

Prompts are written in Markdown and use placeholders for runtime interpolation.

Entity Extraction Prompt

# Extract People from Guantánamo Bay Articles

You are analyzing historical newspaper articles about Guantánamo Bay detention.

## Task
Extract all people mentioned in the article text. For each person, provide:

- **name**: Full name as it appears (e.g., "Carol Rosenberg", "Lt. Col. Morris Davis")
- **type**: One of: {person_types}
- **person_tags**: List of relevant tags from: {person_tags}
- **description**: Brief description of their role or relevance (1-2 sentences)
- **alternative_names**: List of alternative names, titles, or abbreviations

## Guidelines

1. **Use specific names**: Extract "Lt. Col. Morris Davis", not "military prosecutor"
2. **Include all mentions**: Even brief references count
3. **Preserve titles**: Keep military ranks, professional titles ("Dr.", "Judge")
4. **Tag accurately**: Use tags that reflect the person's role in this article
5. **No hallucinations**: Only extract people actually mentioned in the text

## Output Format

Return a JSON array of person objects. Example:

```json
[
  {
    "name": "Carol Rosenberg",
    "type": "journalist",
    "person_tags": [],
    "description": "Miami Herald reporter covering Guantanamo Bay detention issues",
    "alternative_names": []
  },
  {
    "name": "Mohamedou Ould Slahi",
    "type": "detainee",
    "person_tags": ["civil_rights"],
    "description": "Guantanamo detainee who wrote a memoir about his detention",
    "alternative_names": ["Slahi"]
  }
]
load_config()
Dict
Load main config.yaml
load_categories(entity_type)
Dict
Load category YAML (people.yaml, organizations.yaml, etc.)
load_prompt(entity_type)
str
Load extraction prompt Markdown
load_profile_prompt(prompt_type)
str
Load profile prompt (“generation”, “update”, “reflection”)
get_similarity_threshold(entity_type?)
float
Get similarity threshold with fallback to default
get_lexical_blocking_config(entity_type?)
Dict
Get lexical blocking settings with per-type overrides
get_concurrency_config()
Dict
Get worker/thread counts and queue limits
get_cache_config()
Dict
Get cache settings (extraction, embeddings, match_check, articles)
get_name_variants_config(entity_type)
Dict
Get equivalence groups and acronym stopwords

Example: Custom Domain Configuration

Here’s how you might configure a domain for Soviet-Afghan War research:
domain: "soviet_afghan_war"
description: "Soviet-Afghan War (1979-1989) military and political history"

data_sources:
  default_path: "data/soviet_afghan_war/raw_sources/articles.parquet"

output:
  directory: "data/soviet_afghan_war/entities"

dedup:
  similarity_thresholds:
    default: 0.75
    people: 0.80        # military names vary
    organizations: 0.78
    locations: 0.82     # geographic precision important
    events: 0.74        # battles have similar descriptions
  
  name_variants:
    organizations:
      equivalence_groups:
        - ["KGB", "Committee for State Security"]
        - ["Mujahideen", "Afghan Resistance"]
    locations:
      equivalence_groups:
        - ["Afghanistan", "Islamic Republic of Afghanistan"]
        - ["Kabul", "Kabul Province"]

performance:
  concurrency:
    extract_workers: 4  # conservative for local mode
    extract_per_article: 2
    llm_in_flight: 8

Best Practices

Always initialize new domains from the template:
just init my_research_domain
The template provides a working baseline with sensible defaults.
Process a small batch (2-5 articles) after each configuration change:
just process --domain my_research_domain --limit 2 --verbose
Review extraction quality before processing the full corpus.
Generic prompts produce generic results. Include:
  • Domain context: “You are analyzing articles about X…”
  • Real examples: Show the LLM what good output looks like
  • Edge cases: Address common errors (“Don’t extract ‘military bases’ as a name…”)
Default similarity thresholds (0.75-0.82) work well for many domains, but:
  • Increase for domains with high name ambiguity (common names, many variants)
  • Decrease for domains with consistent naming (technical terms, proper nouns)
  • Monitor merge decisions in logs to identify false positives/negatives
Bump cache.extraction.version when you change:
  • Extraction prompts (even minor wording changes)
  • Category definitions (types or tags)
  • LLM model or temperature
This ensures re-extraction with the new configuration.

Next Steps

Processing Pipeline

Learn how domain configurations drive the 5-stage pipeline

Entity Types

Understand entity structure and required fields

Quickstart

Process your first batch of articles with a custom domain

System Architecture

Explore the producer-consumer model and evidence-first merge

Build docs developers (and LLMs) love