Core Design Principles
The system is built on three foundational principles:Evidence-First
Cheap checks (lexical blocking, embeddings) run before expensive LLM calls
Producer-Consumer
Parallel extraction workers feed a single merge actor to avoid race conditions
Sidecar Files
Processing status and caches use separate JSON files instead of mutating source data
High-Level Architecture
Pipeline Components
1. Article Processor
TheArticleProcessor class orchestrates article-level operations across the pipeline.
Location: src/engine/article_processor.py:52
Key responsibilities:
- Domain-aware relevance checking
- Per-entity-type extraction orchestration
- Quality control and retry logic
- Progress metadata aggregation
2. Producer-Consumer Pipeline
The pipeline uses a ThreadPoolExecutor for parallel extraction while maintaining a single-threaded merge actor to eliminate race conditions. Location:src/process_and_extract.py:667
Submit extraction work to thread pool
Multiple articles are processed in parallel (configured via
extract_workers). Within each article, the 4 entity types can also extract concurrently (extract_per_article).Consume results in article order
The main thread waits for each future in submission order, ensuring deterministic merge behavior.
Concurrency configuration is set per domain in
configs/<domain>/config.yaml:3. Entity Merger (Evidence-First)
TheEntityMerger implements a cheap-to-expensive cascade to minimize LLM costs while maintaining high accuracy.
Location: src/engine/mergers.py:110
Merge Decision Cascade
Merge Decision Cascade
Lexical blocking (RapidFuzz)
Fast string similarity (threshold: 60/100) narrows candidates to ~50 entities.
Embedding similarity
Cosine similarity on evidence text embeddings (threshold: 0.75-0.82 depending on entity type).
4. Sidecar Caching System
To avoid mutating source Parquet files repeatedly, Hinbox uses sidecar JSON files for processing state and extraction caches.Processing Status
data/<domain>/entities/processing_status.json tracks which articles have been processed and their content hashes for skip-if-unchanged detection.Extraction Cache
data/<domain>/entities/cache/extractions/<hash>.json stores extraction results keyed by content hash, model, prompt, schema, and temperature.src/utils/processing_status.py, src/utils/extraction_cache.py
The extraction cache version is controlled in domain config:
Module Organization
Hinbox’s codebase is organized by responsibility:Privacy Mode
When--local flag is used, Hinbox enforces complete privacy:
Location:
src/process_and_extract.py:786
Next Steps
Processing Pipeline
Learn about the 5 stages: relevance → extraction → QC → merging → profiles
Entity Types
Understand entity structure and how types are defined per domain
Domain Configuration
Configure thresholds, prompts, and entity categories for your research domain
API Reference
Explore the programmatic API for custom integrations