Skip to main content
Hinbox extracts and manages four core entity types, each with its own schema, keying strategy, and merge behavior. Entity definitions are domain-specific and configured in configs/<domain>/categories/.

The Four Entity Types

People

Individuals mentioned in source documents (detainees, officials, lawyers, journalists, etc.)

Organizations

Institutions and groups (military units, NGOs, government agencies, media outlets)

Locations

Geographic places and facilities (detention centers, military bases, cities, countries)

Events

Temporal occurrences (legal proceedings, protests, transfers, policy changes)

Entity Keys & Deduplication

Each entity type uses a different key strategy for deduplication:
Entity TypeKey FieldsKey TypeExample
Peoplenamestr"Carol Rosenberg"
Organizationsname, typetuple("ACLU", "legal")
Locationsname, typetuple("Guantanamo Bay", "detention_facility")
Eventstitle, start_datetuple("Rasul v. Bush", "2004-06-28")
Location: src/engine/mergers.py:114
Tuple keys enable the same base name to represent different entities when types differ. For example:
  • ("Washington", "city") vs. ("Washington", "state")
  • ("Red Cross", "humanitarian") vs. ("Red Cross Hospital", "location")

Entity Schema Structure

All entities share a common base structure with type-specific extensions:

Common Fields (All Types)

name
string
required
Primary identifier for people, organizations, and locations
type
string
required
Entity type from domain-specific category definitions
description
string
Brief description of the entity extracted from source text
alternative_names
array
List of alternative names, aliases, or abbreviations (e.g., ["DoD", "Pentagon"] for Department of Defense)
profile
object
Versioned narrative profile with sources and confidence
articles
array
List of article IDs mentioning this entity
article_count
integer
Total number of articles mentioning this entity
first_seen
string
ISO 8601 timestamp of first extraction
last_updated
string
ISO 8601 timestamp of most recent update

Embedding Fields (Search & Profiles)

search_embedding
array
Evidence-derived embedding vector for similarity search (preferred for merge decisions)
search_embedding_model
string
Model used for search embedding (e.g., "jina_ai/jina-embeddings-v3")
search_embedding_fingerprint
string
Hash fingerprint of embedding model + parameters for compatibility checks
profile_embedding
array
LLM-narrative-derived embedding (fallback for older entities)
Hinbox prefers search_embedding over profile_embedding for merge decisions because evidence text provides apples-to-apples comparison between new and existing entities. See Evidence-First Similarity for details.

Type-Specific Fields

People

person_tags
array
Domain-specific tags (e.g., ["civil_rights", "lawyer"] in Guantanamo domain)
roles
array
Historical roles extracted from articles (e.g., ["Defense Attorney", "ACLU Lawyer"])
{
  "name": "Clive Stafford Smith",
  "type": "lawyer",
  "person_tags": ["civil_rights", "defense"],
  "description": "British-American attorney representing Guantanamo detainees",
  "alternative_names": ["Stafford Smith"],
  "roles": ["Defense Attorney", "Reprieve Founder"],
  "articles": ["mh_2008_03_12_015", "mh_2010_07_22_031"],
  "article_count": 47,
  "profile": {
    "text": "Clive Stafford Smith is a British-American attorney...",
    "confidence": 0.95,
    "sources": ["mh_2008_03_12_015", "mh_2010_07_22_031"]
  }
}

Organizations

acronym
string
Common acronym (e.g., "ACLU", "DoD", "JTF-GTMO")
parent_organization
string
Name of parent organization if applicable
{
  "name": "American Civil Liberties Union",
  "type": "legal",
  "acronym": "ACLU",
  "description": "Civil rights advocacy organization",
  "alternative_names": ["ACLU"],
  "articles": ["mh_2007_05_10_008", "mh_2011_09_14_022"],
  "article_count": 128,
  "profile": {
    "text": "The American Civil Liberties Union (ACLU) is a nonprofit...",
    "confidence": 0.98,
    "sources": ["mh_2007_05_10_008", "mh_2011_09_14_022"]
  }
}

Locations

coordinates
object
Geographic coordinates if available
country
string
Country name for location disambiguation
{
  "name": "Guantanamo Bay",
  "type": "detention_facility",
  "description": "U.S. Naval base and detention facility in Cuba",
  "alternative_names": ["GTMO", "Naval Station Guantanamo Bay"],
  "country": "Cuba",
  "coordinates": {
    "latitude": 19.9073,
    "longitude": -75.0918
  },
  "articles": ["mh_2002_01_11_001", "mh_2023_12_15_891"],
  "article_count": 1247
}

Events

title
string
required
Event title (part of composite key with start_date)
start_date
string
required
ISO 8601 date when event started (part of composite key)
end_date
string
ISO 8601 date when event ended (for ongoing events, may be null)
event_tags
array
Domain-specific tags (e.g., ["legal_challenge", "habeas_corpus"])
participants
array
Names of people or organizations involved in the event
{
  "title": "Rasul v. Bush",
  "type": "legal",
  "event_tags": ["legal_challenge", "habeas_corpus"],
  "start_date": "2004-06-28",
  "end_date": "2004-06-28",
  "description": "Supreme Court ruling on Guantanamo detainee rights",
  "alternative_titles": ["Rasul v. Bush Supreme Court Decision"],
  "participants": ["Shafiq Rasul", "George W. Bush"],
  "articles": ["mh_2004_06_28_042", "mh_2004_06_29_051"],
  "article_count": 23
}

Dynamic Pydantic Models

Entity schemas are dynamically generated from domain configurations at runtime using src/dynamic_models.py. This allows each research domain to define custom types and tags. Location: src/dynamic_models.py
from pydantic import BaseModel, Field
from src.config_loader import DomainConfig

def create_person_model(domain: str) -> type[BaseModel]:
    """Dynamically create Person model from domain config."""
    config = DomainConfig(domain)
    person_types = config.load_categories("people")
    
    # Extract valid person_types and person_tags from YAML
    valid_types = list(person_types["person_types"].keys())
    valid_tags = list(person_types.get("person_tags", {}).keys())
    
    class Person(BaseModel):
        name: str = Field(..., description="Full name of the person")
        type: str = Field(..., description=f"One of: {', '.join(valid_types)}")
        person_tags: List[str] = Field(default_factory=list)
        description: str = Field(default="")
        # ... other fields
    
    return Person
This design enables zero-code domain creation: researchers can define new entity types and tags purely through YAML configuration files.

Entity Storage (Parquet)

Entities are stored as one Parquet file per type in the domain output directory:
data/<domain>/entities/
├── people.parquet
├── organizations.parquet
├── locations.parquet
└── events.parquet
Parquet provides:
  • Fast columnar reads for analytics and filtering
  • Schema validation ensuring data consistency
  • Compression reducing disk usage by 10-50x vs. JSON
  • Append-only writes for incremental processing
Location: src/utils/file_ops.py:write_entities_table

Entity Lifecycle

1

Extraction

LLM extracts entity from article text using domain-specific prompt
2

Quality Control

Validates required fields, deduplicates within article, flags low-quality names
3

Merge Check

Evidence-first cascade determines if entity matches existing record
4

Profile Generation

Creates or updates versioned narrative profile with citations
5

Profile Grounding

Batch verification ensures profile claims are supported by source articles
6

Parquet Write

Atomic write of all entities to respective Parquet files

Name Handling & Canonicalization

Hinbox includes sophisticated name handling to deal with variants and aliases:

Equivalence Groups

Define known aliases in domain config:
dedup:
  name_variants:
    organizations:
      equivalence_groups:
        - ["Department of Defense", "Defense Department", "DoD", "Pentagon"]
        - ["Joint Task Force Guantanamo", "JTF-GTMO", "JTF GTMO"]
Location: configs/<domain>/config.yaml, src/utils/name_variants.py

Acronym Detection

Automatically detects when a name is an acronym form:
from src.utils.name_variants import is_acronym_form, compute_acronym

is_acronym_form("ACLU", "American Civil Liberties Union")  # True
is_acronym_form("DoD", "Department of Defense")  # True

compute_acronym("Department of Defense")  # "DoD"
compute_acronym("Joint Task Force Guantanamo")  # "JTF-G"

Canonical Name Scoring

When multiple name variants exist, the best canonical name is selected based on:
  1. Specificity: Longer, more complete names
  2. Frequency: Number of article mentions
  3. Acronym avoidance: Prefer full names over acronyms for primary identifier
Location: src/utils/name_variants.py:score_canonical_name

Next Steps

Domain Configuration

Define custom entity types, tags, and prompts for your research domain

Processing Pipeline

Learn how entities flow through the 5-stage pipeline

System Architecture

Understand the producer-consumer model and merge cascade

API Reference

Programmatic access to entity data and pipeline stages

Build docs developers (and LLMs) love