Processing Pipeline

The Hinbox pipeline transforms raw historical source documents into structured entity databases through five distinct stages. Each stage is designed to maximize accuracy while minimizing computational cost.

Pipeline Overview

Relevance Filtering

Determines if an article is relevant to the research domain before extraction

Entity Extraction

Extracts people, organizations, locations, and events using LLMs with structured output

Quality Control

Validates extraction results and triggers retry with repair hints on severe issues

Entity Merging

Deduplicates entities using evidence-first similarity and LLM verification

Profile Generation

Creates and maintains versioned narrative profiles for each entity

Stage 1: Relevance Filtering

Before extracting entities, the pipeline optionally checks if an article is relevant to the research domain. This prevents wasting resources on off-topic documents. Location: src/engine/relevance.py, src/engine/article_processor.py:70

How It Works

The relevance checker uses a domain-specific prompt loaded from configs/<domain>/prompts/relevance.md to determine if the article content matches the research focus.

# From src/engine/article_processor.py:70
def check_relevance(
    self,
    article_content: str,
    article_id: str,
) -> PhaseOutcome:
    """Check whether an article is relevant to the configured domain."""
    try:
        if self.model_type == "ollama":
            result = ollama_check_relevance(
                text=article_content,
                model=self.specific_model,
                domain=self.domain,
            )
        else:
            result = gemini_check_relevance(
                text=article_content,
                model=self.specific_model,
                domain=self.domain,
            )
        
        # Normalize result to boolean
        is_relevant = bool(result.is_relevant)
        reason = getattr(result, "reason", "")
        
        return PhaseOutcome.ok(
            "relevance",
            value=is_relevant,
            meta={"reason": reason},
        )
    except Exception as e:
        # On error, fail open (assume relevant)
        return PhaseOutcome.fail(
            "relevance",
            error=e,
            fallback=False,
            context={"article_id": article_id},
        )

Relevance checking is optional and controlled by the --relevance-check CLI flag or domain config:

processing:
  relevance_check: true

When disabled, all articles proceed directly to extraction.

PhaseOutcome Pattern

All pipeline stages return a PhaseOutcome object that carries both the result value and observability metadata:

Success
Failure

PhaseOutcome.ok(
    "relevance",
    value=True,
    meta={"reason": "Article discusses Guantanamo detention policies"},
)

PhaseOutcome.fail(
    "relevance",
    error=RelevanceCheckError("API timeout"),
    fallback=False,  # assume not relevant on error
    context={"article_id": "mh_2024_01_15_001"},
)

Stage 2: Entity Extraction

The extraction stage runs concurrently for all four entity types (people, organizations, locations, events) using Instructor with Pydantic models for structured output. Location: src/engine/article_processor.py:282, src/engine/extractors.py

Concurrent Extraction

Within each article, entity types can be extracted in parallel using a ThreadPoolExecutor:

# From src/engine/article_processor.py:305
if max_workers > 1:
    from concurrent.futures import ThreadPoolExecutor
    
    with ThreadPoolExecutor(
        max_workers=min(max_workers, len(entity_types))
    ) as pool:
        futures = {
            et: pool.submit(
                self.extract_single_entity_type,
                et,
                article_content,
                article_id,
            )
            for et in entity_types
        }
        # Iterate in stable order (not completion order)
        for et in entity_types:
            outcome = futures[et].result()
            entities[et] = outcome.value or []
            extraction_outcomes[et] = outcome.to_metadata_dict()
else:
    # Serial extraction (max_workers=1)
    for et in entity_types:
        outcome = self.extract_single_entity_type(
            et, article_content, article_id
        )
        entities[et] = outcome.value or []
        extraction_outcomes[et] = outcome.to_metadata_dict()

The extract_per_article setting in domain config controls parallel entity types:

performance:
  concurrency:
    extract_per_article: 4  # all 4 types in parallel

Set to 1 for serial extraction (useful for debugging).

EntityExtractor

The EntityExtractor class provides a unified interface for both cloud (Gemini) and local (Ollama) extraction: Location: src/engine/extractors.py

Extractor Interface

class EntityExtractor:
    def __init__(self, entity_type: str, domain: str):
        self.entity_type = entity_type
        self.domain = domain
        self.prompt = DomainConfig(domain).load_prompt(entity_type)
        self.model_class = # ... dynamically loaded Pydantic model
    
    def extract_cloud(
        self,
        text: str,
        model: str = CLOUD_MODEL,
        temperature: float = 0,
        repair_hint: Optional[str] = None,
    ) -> List[BaseModel]:
        """Extract entities using cloud LLM (Gemini)."""
    
    def extract_local(
        self,
        text: str,
        model: str = OLLAMA_MODEL,
        temperature: float = 0,
        repair_hint: Optional[str] = None,
    ) -> List[BaseModel]:
        """Extract entities using local LLM (Ollama)."""

Extraction Caching

Extraction results are cached in sidecar JSON files keyed by:

Content hash (SHA-256 of article text)
Model name and temperature
Prompt template hash
Pydantic schema hash

Location: src/utils/extraction_cache.py Cache hits skip expensive LLM calls entirely, providing 10-100x speedup on re-runs.

data/<domain>/entities/cache/extractions/
├── a3f2e1b4c9d8..._people.json
├── a3f2e1b4c9d8..._organizations.json
└── ...

Stage 3: Quality Control & Retry

After extraction, each entity type goes through quality control checks. Severe issues trigger a single retry attempt with a repair hint. Location: src/engine/article_processor.py:153, src/utils/quality_controls.py

QC Checks

Missing Required Fields

Drops entities lacking mandatory fields (name, type, etc.)

Within-Article Duplicates

Deduplicates entities with identical normalized names

Low-Quality Names

Flags generic plurals like “defense departments” or “military bases”

Zero Entities

Triggers retry if extraction returned an empty list for a relevant article

Retry Logic

When severe QC flags are detected, the extractor retries once with a repair hint appended to the system prompt:

# From src/engine/article_processor.py:175
# Attempt 1
raw_dicts = self._run_extraction(extractor, article_content)
cleaned, qc_report = run_extraction_qc(
    entity_type=entity_type,
    entities=raw_dicts,
    domain=self.domain,
)

# Conditional retry
if _should_retry_extraction(qc_report.flags):
    logger.info(f"Retrying {entity_type} extraction (triggers: {qc_report.flags})")
    
    hint = _build_repair_hint(entity_type, qc_report.flags)
    raw_dicts_v2 = self._run_extraction(
        extractor, article_content, repair_hint=hint
    )
    cleaned_v2, qc_report_v2 = run_extraction_qc(
        entity_type=entity_type,
        entities=raw_dicts_v2,
        domain=self.domain,
    )
    
    # Pick the better result (higher output_count wins)
    use_v2 = qc_report_v2.output_count > qc_report.output_count
    if use_v2:
        cleaned, qc_report = cleaned_v2, qc_report_v2

Retry trigger flags:

zero_entities — extraction returned empty list
high_drop_rate — >30% of entities dropped due to missing fields
many_duplicates — >20% duplicate names
many_low_quality_names — >30% generic/plural names

Repair Hints

Repair hints describe what went wrong on the first attempt:

Zero Entities
Low-Quality Names

IMPORTANT — Previous extraction of people had quality issues 
(zero_entities). Please ensure all required fields are populated, 
avoid duplicate entries, and return every relevant entity found 
in the text as a complete JSON array.

IMPORTANT — Previous extraction of organizations had quality issues 
(many_low_quality_names). Use proper nouns for entity names — avoid 
generic plurals like 'defense departments' or descriptive phrases 
like 'military base in Cuba'. Name each entity with its specific, 
official name.

Stage 4: Entity Merging

The merger deduplicates entities across articles using an evidence-first cascade from cheap to expensive checks. Location: src/engine/mergers.py:110

Merge Decision Cascade

Exact key match

If entity key (e.g., "John Doe" or ("ACLU", "legal")) exists, merge immediately without further checks.

Lexical blocking (RapidFuzz)

Fast fuzzy string matching narrows candidates to ~50 entities (threshold: 60/100).Location: src/engine/mergers.py, uses RapidFuzz library

Evidence text embeddings

Build evidence text from article mentions + context windows, embed it, and compute cosine similarity against existing entities.Thresholds (from configs/guantanamo/config.yaml):

People: 0.82
Organizations: 0.78
Locations: 0.80
Events: 0.76

LLM match check (temperature=0)

Deterministic yes/no/uncertain decision using article text + existing profile.Location: src/engine/match_checker.py

Dispute agent (gray band)

When match checker returns “uncertain” or confidence falls in gray band (0.60-0.75), a second LLM arbitrates with chain-of-thought reasoning.Location: src/engine/merge_dispute_agent.py

Evidence-First Similarity

Instead of embedding LLM-generated profile narratives, Hinbox embeds evidence text built from actual article mentions:

# From src/engine/mergers.py (conceptual)
def _build_evidence_text(
    entity_name: str,
    article_text: str,
    max_chars: int = 1500,
    window_chars: int = 240,
    max_windows: int = 3,
) -> str:
    """
    Build evidence text from article mentions with context windows.
    
    Example output:
    "John Doe, detainee, described in source as: 'John Doe was 
    transferred to Guantanamo in 2002 after being captured in 
    Afghanistan...' [240 chars] '...Doe's lawyer filed a habeas 
    corpus petition...' [240 chars]"
    """

This provides apples-to-apples comparison during similarity search: new entity evidence vs. existing entity evidence.

Evidence parameters are configurable per domain:

merge_evidence:
  max_chars: 1500      # total evidence text length
  window_chars: 240    # context window size around mentions
  max_windows: 3       # max context windows to extract

Canonical Name Selection

When merging entities, Hinbox picks the best canonical name based on:

Specificity: Longer, more complete names (“Barack H. Obama” > “Barack Obama”)
Frequency: Names appearing in more articles
Officialness: Names matching known acronyms or formal titles

Location: src/utils/name_variants.py:score_canonical_name

Name Variants & Equivalence

Domains can define equivalence groups for known aliases:

dedup:
  name_variants:
    organizations:
      equivalence_groups:
        - ["Department of Defense", "Defense Department", "DoD", "Pentagon"]
        - ["American Civil Liberties Union", "ACLU"]
        - ["Joint Task Force Guantanamo", "JTF-GTMO", "JTF GTMO"]
    locations:
      equivalence_groups:
        - ["Guantanamo Bay", "Guantanamo", "GTMO", "Naval Station Guantanamo Bay"]
        - ["United States", "U.S.", "US"]

Equivalence groups enable instant blocking without LLM calls.

Stage 5: Profile Generation

Once entities are merged, Hinbox generates (or updates) a versioned narrative profile for each entity. Location: src/engine/profiles.py

VersionedProfile

Profiles maintain a complete version history, enabling temporal analysis:

class VersionedProfile(BaseModel):
    """Container for versioned profile history."""
    
    current_version: int = 1
    versions: List[ProfileVersion] = Field(default_factory=list)
    
    def add_version(
        self, 
        profile_data: Dict, 
        trigger_article_id: Optional[str] = None
    ) -> ProfileVersion:
        """Add a new version to the history."""
        new_version = ProfileVersion(
            version_number=len(self.versions) + 1,
            profile_data=copy.deepcopy(profile_data),
            trigger_article_id=trigger_article_id,
        )
        self.versions.append(new_version)
        self.current_version = new_version.version_number
        return new_version

Profile versioning is controlled by the ENABLE_PROFILE_VERSIONING constant in src/constants.py. When enabled, every profile update creates a new version with timestamp and triggering article ID.

Profile QC & Grounding

After profile generation, Hinbox runs a grounding verification batch job to ensure claims are supported by source articles. Location: src/utils/quality_controls.py:verify_profile_grounding The grounding report includes:

Citation extraction: Parse [article_id] citations from profile text
Claim verification: Check each claim against source article text
Grounding score: Percentage of verified citations (0.0-1.0)

Well-Grounded Profile
Poorly-Grounded Profile

{
  "total_citations": 5,
  "verified": 5,
  "unverified": 0,
  "grounding_score": 1.0,
  "profile_text_hash": "a3f2e1b4c9d8..."
}

{
  "total_citations": 8,
  "verified": 3,
  "unverified": 5,
  "grounding_score": 0.375,
  "profile_text_hash": "b2c4f8a1e9d3...",
  "issues": ["Citation [mh_2024_01_20_003] not found in article text"]
}

Pipeline Metrics

At the end of processing, the pipeline logs comprehensive statistics:

Processing complete

Articles read: 100
Articles processed: 87
Articles skipped (relevance): 8
Articles skipped (already processed): 5

Final entity counts:
• People: 342
• Organizations: 156
• Locations: 89
• Events: 203

Reflection statistics:
• Total reflection attempts: 42
• Average reflection attempts per article: 0.48

Grounding complete: 234 verified, 301 unchanged, 256 no citations

Location: src/process_and_extract.py:366

Next Steps

Entity Types

Learn about entity structure, required fields, and tags

Domain Configuration

Configure prompts, thresholds, and categories for your research

System Architecture

Understand the producer-consumer model and sidecar files

Running the Pipeline

Process your first batch of historical sources

Get Started

Core Concepts

Guides

Advanced

Processing Pipeline

Pipeline Overview

Stage 1: Relevance Filtering

How It Works

PhaseOutcome Pattern

Stage 2: Entity Extraction

Concurrent Extraction

EntityExtractor

Extraction Caching

Stage 3: Quality Control & Retry

QC Checks

Missing Required Fields

Within-Article Duplicates

Low-Quality Names

Zero Entities

Retry Logic

Repair Hints

Stage 4: Entity Merging

Merge Decision Cascade

Evidence-First Similarity

Canonical Name Selection

Name Variants & Equivalence

Stage 5: Profile Generation

VersionedProfile

Profile QC & Grounding

Pipeline Metrics

Next Steps

Entity Types

Domain Configuration

System Architecture

Running the Pipeline

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced

​Pipeline Overview

​Stage 1: Relevance Filtering

​How It Works

​PhaseOutcome Pattern

​Stage 2: Entity Extraction

​Concurrent Extraction

​EntityExtractor

​Extraction Caching

​Stage 3: Quality Control & Retry

​QC Checks

Missing Required Fields

Within-Article Duplicates

Low-Quality Names

Zero Entities

​Retry Logic

​Repair Hints

​Stage 4: Entity Merging

​Merge Decision Cascade

​Evidence-First Similarity

​Canonical Name Selection

​Name Variants & Equivalence

​Stage 5: Profile Generation

​VersionedProfile

​Profile QC & Grounding

​Pipeline Metrics

​Next Steps

Entity Types

Domain Configuration

System Architecture

Running the Pipeline

Build docs developers (and LLMs) love

Pipeline Overview

Stage 1: Relevance Filtering

How It Works

PhaseOutcome Pattern

Stage 2: Entity Extraction

Concurrent Extraction

EntityExtractor

Extraction Caching

Stage 3: Quality Control & Retry

QC Checks

Retry Logic

Repair Hints

Stage 4: Entity Merging

Merge Decision Cascade

Evidence-First Similarity

Canonical Name Selection

Name Variants & Equivalence

Stage 5: Profile Generation

VersionedProfile

Profile QC & Grounding

Pipeline Metrics

Next Steps