Pipeline Overview
Entity Extraction
Extracts people, organizations, locations, and events using LLMs with structured output
Stage 1: Relevance Filtering
Before extracting entities, the pipeline optionally checks if an article is relevant to the research domain. This prevents wasting resources on off-topic documents. Location:src/engine/relevance.py, src/engine/article_processor.py:70
How It Works
The relevance checker uses a domain-specific prompt loaded fromconfigs/<domain>/prompts/relevance.md to determine if the article content matches the research focus.
Relevance checking is optional and controlled by the When disabled, all articles proceed directly to extraction.
--relevance-check CLI flag or domain config:PhaseOutcome Pattern
All pipeline stages return aPhaseOutcome object that carries both the result value and observability metadata:
- Success
- Failure
Stage 2: Entity Extraction
The extraction stage runs concurrently for all four entity types (people, organizations, locations, events) using Instructor with Pydantic models for structured output. Location:src/engine/article_processor.py:282, src/engine/extractors.py
Concurrent Extraction
Within each article, entity types can be extracted in parallel using aThreadPoolExecutor:
The Set to
extract_per_article setting in domain config controls parallel entity types:1 for serial extraction (useful for debugging).EntityExtractor
TheEntityExtractor class provides a unified interface for both cloud (Gemini) and local (Ollama) extraction:
Location: src/engine/extractors.py
Extractor Interface
Extractor Interface
Extraction Caching
Extraction results are cached in sidecar JSON files keyed by:- Content hash (SHA-256 of article text)
- Model name and temperature
- Prompt template hash
- Pydantic schema hash
src/utils/extraction_cache.py
Cache hits skip expensive LLM calls entirely, providing 10-100x speedup on re-runs.
Stage 3: Quality Control & Retry
After extraction, each entity type goes through quality control checks. Severe issues trigger a single retry attempt with a repair hint. Location:src/engine/article_processor.py:153, src/utils/quality_controls.py
QC Checks
Missing Required Fields
Drops entities lacking mandatory fields (name, type, etc.)
Within-Article Duplicates
Deduplicates entities with identical normalized names
Low-Quality Names
Flags generic plurals like “defense departments” or “military bases”
Zero Entities
Triggers retry if extraction returned an empty list for a relevant article
Retry Logic
When severe QC flags are detected, the extractor retries once with a repair hint appended to the system prompt:Retry trigger flags:
zero_entities— extraction returned empty listhigh_drop_rate— >30% of entities dropped due to missing fieldsmany_duplicates— >20% duplicate namesmany_low_quality_names— >30% generic/plural names
Repair Hints
Repair hints describe what went wrong on the first attempt:- Zero Entities
- Low-Quality Names
Stage 4: Entity Merging
The merger deduplicates entities across articles using an evidence-first cascade from cheap to expensive checks. Location:src/engine/mergers.py:110
Merge Decision Cascade
Exact key match
If entity key (e.g.,
"John Doe" or ("ACLU", "legal")) exists, merge immediately without further checks.Lexical blocking (RapidFuzz)
Fast fuzzy string matching narrows candidates to ~50 entities (threshold: 60/100).Location:
src/engine/mergers.py, uses RapidFuzz libraryEvidence text embeddings
Build evidence text from article mentions + context windows, embed it, and compute cosine similarity against existing entities.Thresholds (from
configs/guantanamo/config.yaml):- People: 0.82
- Organizations: 0.78
- Locations: 0.80
- Events: 0.76
LLM match check (temperature=0)
Deterministic yes/no/uncertain decision using article text + existing profile.Location:
src/engine/match_checker.pyEvidence-First Similarity
Instead of embedding LLM-generated profile narratives, Hinbox embeds evidence text built from actual article mentions:Evidence parameters are configurable per domain:
Canonical Name Selection
When merging entities, Hinbox picks the best canonical name based on:- Specificity: Longer, more complete names (“Barack H. Obama” > “Barack Obama”)
- Frequency: Names appearing in more articles
- Officialness: Names matching known acronyms or formal titles
src/utils/name_variants.py:score_canonical_name
Name Variants & Equivalence
Domains can define equivalence groups for known aliases:Stage 5: Profile Generation
Once entities are merged, Hinbox generates (or updates) a versioned narrative profile for each entity. Location:src/engine/profiles.py
VersionedProfile
Profiles maintain a complete version history, enabling temporal analysis:Profile versioning is controlled by the
ENABLE_PROFILE_VERSIONING constant in src/constants.py. When enabled, every profile update creates a new version with timestamp and triggering article ID.Profile QC & Grounding
After profile generation, Hinbox runs a grounding verification batch job to ensure claims are supported by source articles. Location:src/utils/quality_controls.py:verify_profile_grounding
The grounding report includes:
- Citation extraction: Parse
[article_id]citations from profile text - Claim verification: Check each claim against source article text
- Grounding score: Percentage of verified citations (0.0-1.0)
- Well-Grounded Profile
- Poorly-Grounded Profile
Pipeline Metrics
At the end of processing, the pipeline logs comprehensive statistics:src/process_and_extract.py:366
Next Steps
Entity Types
Learn about entity structure, required fields, and tags
Domain Configuration
Configure prompts, thresholds, and categories for your research
System Architecture
Understand the producer-consumer model and sidecar files
Running the Pipeline
Process your first batch of historical sources