Skip to main content
Hinbox is a flexible, domain-configurable entity extraction system designed for historians and researchers. It processes historical documents, academic papers, news articles, and book chapters to extract structured information about people, organizations, locations, and events. Hinbox dashboard Originally designed for Guantánamo Bay media coverage analysis, Hinbox now supports any historical or research domain through a simple configuration system.

Key features

Research-focused

Built specifically for historians, academics, and researchers working with large document collections

Domain-agnostic

Configure for any historical period, region, or research topic through simple YAML and Markdown files

Multiple AI models

Support for both cloud (Gemini default via LiteLLM) and local (Ollama) models with privacy mode

Smart deduplication

RapidFuzz lexical blocking + embedding similarity with per-entity-type thresholds

Profile versioning

Track how entity profiles evolve as new sources are processed

Citation-backed claims

Profile grounding verification ensures all claims are supported by source articles

What you can extract

Hinbox extracts four core entity types from your historical sources:
  • People - Individuals mentioned in your sources with roles, affiliations, and biographical details
  • Organizations - Groups, institutions, companies, and agencies
  • Locations - Places, regions, facilities, and geographic entities
  • Events - Historical events, incidents, meetings, and significant occurrences
Each entity type is fully customizable through domain configuration files. You define the categories and tags that make sense for your research.

How it works

1

Configure your domain

Create a research domain with custom entity types, extraction prompts, and data paths. No Python coding required.
2

Process your sources

Feed in historical documents in Parquet format. Hinbox extracts entities using AI models with automatic quality controls.
3

Smart merging

Entities are deduplicated using lexical blocking and embedding similarity. A second-stage LLM arbitrates ambiguous matches.
4

Explore results

Browse extracted entities in the FastHTML web interface with confidence badges, aliases, and version history.

Advanced capabilities

Extraction quality controls

Deterministic QC validates extraction output with automatic retry when severe issues are detected:
  • Zero entities extracted
  • High drop rates during processing
  • Missing required fields
  • Invalid name normalization

Merge dispute agent

Second-stage LLM arbitration for ambiguous entity matches near similarity thresholds. The dispute agent analyzes gray-band matches with low confidence scores.

5-layer canonical name selection

Deterministic scoring picks the best display name across aliases:
  • Penalizes acronyms and generic phrases
  • Prefers full, descriptive names
  • Handles merge scenarios intelligently

Extraction caching

Persistent sidecar cache avoids redundant LLM calls:
  • Keyed on content hash, model, prompt, schema, and temperature
  • Skips re-processing unchanged articles
  • Configurable cache version for invalidation

Privacy mode

Use the --local flag to enforce local-only processing:
  • Local embeddings only
  • Disables all LLM telemetry callbacks
  • Perfect for sensitive historical research

Get started

Installation

Install Hinbox with uv and set up your environment

Quick start

Create your first domain and process historical sources

Configuration

Learn how to configure domains for your research

API Reference

Explore the processing pipeline and modules

Built with

Hinbox is built with modern Python tools and libraries:
  • Python 3.12+ - Core language
  • Pydantic - Schema validation and dynamic model generation
  • FastHTML - Web interface with “Archival Elegance” design
  • LiteLLM - Unified API for cloud and local models
  • Instructor - Structured LLM output
  • RapidFuzz - Fast lexical blocking
  • Jina Embeddings - Cloud embedding generation
  • Rich - Beautiful terminal logging

Example use cases

History of food in Palestine

Extract farmers, traders, cookbook authors, agricultural cooperatives, markets, harvest events, and recipe documentation from historical texts.

Soviet-Afghan War (1980s)

Identify military leaders, intelligence agencies, battles, refugee movements, and border crossings from news archives and diplomatic cables.

Medieval trade networks

Discover merchants, trading companies, trade routes, market fairs, and diplomatic missions from historical records.
Hinbox was originally developed for analyzing Guantánamo Bay media coverage but is now fully domain-agnostic. You can configure it for any historical period, region, or research topic.

Build docs developers (and LLMs) love