Entity Types & Structure

Hinbox extracts and manages four core entity types, each with its own schema, keying strategy, and merge behavior. Entity definitions are domain-specific and configured in configs/<domain>/categories/.

The Four Entity Types

People

Individuals mentioned in source documents (detainees, officials, lawyers, journalists, etc.)

Organizations

Institutions and groups (military units, NGOs, government agencies, media outlets)

Locations

Geographic places and facilities (detention centers, military bases, cities, countries)

Events

Temporal occurrences (legal proceedings, protests, transfers, policy changes)

Entity Keys & Deduplication

Each entity type uses a different key strategy for deduplication:

Entity Type	Key Fields	Key Type	Example
People	`name`	`str`	`"Carol Rosenberg"`
Organizations	`name`, `type`	`tuple`	`("ACLU", "legal")`
Locations	`name`, `type`	`tuple`	`("Guantanamo Bay", "detention_facility")`
Events	`title`, `start_date`	`tuple`	`("Rasul v. Bush", "2004-06-28")`

Location: src/engine/mergers.py:114

Tuple keys enable the same base name to represent different entities when types differ. For example:

("Washington", "city") vs. ("Washington", "state")
("Red Cross", "humanitarian") vs. ("Red Cross Hospital", "location")

Entity Schema Structure

All entities share a common base structure with type-specific extensions:

Common Fields (All Types)

name

string

required

Primary identifier for people, organizations, and locations

type

string

required

Entity type from domain-specific category definitions

description

string

Brief description of the entity extracted from source text

alternative_names

array

List of alternative names, aliases, or abbreviations (e.g., ["DoD", "Pentagon"] for Department of Defense)

profile

object

Versioned narrative profile with sources and confidence

Show Profile structure

{
  "text": "Comprehensive narrative about the entity with [article_id] citations...",
  "tags": ["civil_rights", "legal"],
  "confidence": 0.92,
  "sources": ["mh_2024_01_15_001", "mh_2024_01_16_042"]
}

articles

array

List of article IDs mentioning this entity

article_count

integer

Total number of articles mentioning this entity

first_seen

string

ISO 8601 timestamp of first extraction

last_updated

string

ISO 8601 timestamp of most recent update

Embedding Fields (Search & Profiles)

search_embedding

array

Evidence-derived embedding vector for similarity search (preferred for merge decisions)

search_embedding_model

string

Model used for search embedding (e.g., "jina_ai/jina-embeddings-v3")

search_embedding_fingerprint

string

Hash fingerprint of embedding model + parameters for compatibility checks

profile_embedding

array

LLM-narrative-derived embedding (fallback for older entities)

Hinbox prefers search_embedding over profile_embedding for merge decisions because evidence text provides apples-to-apples comparison between new and existing entities. See Evidence-First Similarity for details.

Type-Specific Fields

People

person_tags

array

Domain-specific tags (e.g., ["civil_rights", "lawyer"] in Guantanamo domain)

roles

array

Historical roles extracted from articles (e.g., ["Defense Attorney", "ACLU Lawyer"])

Example: Person Entity
Category Definition (people.yaml)

{
  "name": "Clive Stafford Smith",
  "type": "lawyer",
  "person_tags": ["civil_rights", "defense"],
  "description": "British-American attorney representing Guantanamo detainees",
  "alternative_names": ["Stafford Smith"],
  "roles": ["Defense Attorney", "Reprieve Founder"],
  "articles": ["mh_2008_03_12_015", "mh_2010_07_22_031"],
  "article_count": 47,
  "profile": {
    "text": "Clive Stafford Smith is a British-American attorney...",
    "confidence": 0.95,
    "sources": ["mh_2008_03_12_015", "mh_2010_07_22_031"]
  }
}

person_types:
  lawyer:
    description: "Attorneys, legal representatives, and other legal professionals"
    examples: ["Clive Stafford Smith", "Gitanjali Gutierrez", "David Remes"]
  
  detainee:
    description: "A person who is or was detained at Guantánamo Bay"
    examples: ["Mohamedou Ould Slahi", "David Hicks"]

person_tags:
  civil_rights:
    description: "People involved in civil rights advocacy or organizations"
    examples: ["ACLU lawyers", "Human rights activists"]
  
  defense:
    description: "People involved in defense or military defense roles"
    examples: ["Defense attorneys", "Military defense counsel"]

Organizations

acronym

string

Common acronym (e.g., "ACLU", "DoD", "JTF-GTMO")

parent_organization

string

Name of parent organization if applicable

Example: Organization Entity
Category Definition (organizations.yaml)

{
  "name": "American Civil Liberties Union",
  "type": "legal",
  "acronym": "ACLU",
  "description": "Civil rights advocacy organization",
  "alternative_names": ["ACLU"],
  "articles": ["mh_2007_05_10_008", "mh_2011_09_14_022"],
  "article_count": 128,
  "profile": {
    "text": "The American Civil Liberties Union (ACLU) is a nonprofit...",
    "confidence": 0.98,
    "sources": ["mh_2007_05_10_008", "mh_2011_09_14_022"]
  }
}

organization_types:
  legal:
    description: "Legal organizations and law firms"
    examples: ["ACLU", "Center for Constitutional Rights", "Reprieve"]
  
  military:
    description: "Military organizations"
    examples: ["Joint Task Force Guantanamo", "US Army", "US Navy"]
  
  humanitarian:
    description: "Organizations focused on humanitarian aid and human rights"
    examples: ["International Red Cross", "Physicians for Human Rights"]

Locations

coordinates

object

Geographic coordinates if available

Show Coordinates structure

{
  "latitude": 19.9073,
  "longitude": -75.0918
}

country

string

Country name for location disambiguation

Example: Location Entity
Category Definition (locations.yaml)

{
  "name": "Guantanamo Bay",
  "type": "detention_facility",
  "description": "U.S. Naval base and detention facility in Cuba",
  "alternative_names": ["GTMO", "Naval Station Guantanamo Bay"],
  "country": "Cuba",
  "coordinates": {
    "latitude": 19.9073,
    "longitude": -75.0918
  },
  "articles": ["mh_2002_01_11_001", "mh_2023_12_15_891"],
  "article_count": 1247
}

location_types:
  detention_facility:
    description: "Detention centers and prisons"
    examples: ["Guantanamo Bay", "Camp Delta", "Camp X-Ray"]
  
  military_base:
    description: "Military installations and bases"
    examples: ["Naval Station Guantanamo Bay", "Bagram Air Base"]
  
  city:
    description: "Cities and municipalities"
    examples: ["Washington D.C.", "Miami", "Havana"]

Events

title

string

required

Event title (part of composite key with start_date)

start_date

string

required

ISO 8601 date when event started (part of composite key)

end_date

string

ISO 8601 date when event ended (for ongoing events, may be null)

event_tags

array

Domain-specific tags (e.g., ["legal_challenge", "habeas_corpus"])

participants

array

Names of people or organizations involved in the event

Example: Event Entity
Category Definition (events.yaml)

{
  "title": "Rasul v. Bush",
  "type": "legal",
  "event_tags": ["legal_challenge", "habeas_corpus"],
  "start_date": "2004-06-28",
  "end_date": "2004-06-28",
  "description": "Supreme Court ruling on Guantanamo detainee rights",
  "alternative_titles": ["Rasul v. Bush Supreme Court Decision"],
  "participants": ["Shafiq Rasul", "George W. Bush"],
  "articles": ["mh_2004_06_28_042", "mh_2004_06_29_051"],
  "article_count": 23
}

event_types:
  legal:
    description: "Court cases, hearings, tribunals, legal proceedings"
    examples: ["Habeas Corpus Hearing", "Military Commission Trial"]
  
  detention:
    description: "Events related to capture, transfer, or detention"
    examples: ["Detainee Transfer", "Capture Operation"]
  
  policy_change:
    description: "Changes in government or military policy"
    examples: ["Executive Order", "Policy Directive"]

event_tags:
  legal_challenge:
    description: "Legal challenges to detention"
  
  habeas_corpus:
    description: "Habeas corpus proceedings"
  
  torture:
    description: "Torture allegations or incidents"

Dynamic Pydantic Models

Entity schemas are dynamically generated from domain configurations at runtime using src/dynamic_models.py. This allows each research domain to define custom types and tags. Location: src/dynamic_models.py

from pydantic import BaseModel, Field
from src.config_loader import DomainConfig

def create_person_model(domain: str) -> type[BaseModel]:
    """Dynamically create Person model from domain config."""
    config = DomainConfig(domain)
    person_types = config.load_categories("people")
    
    # Extract valid person_types and person_tags from YAML
    valid_types = list(person_types["person_types"].keys())
    valid_tags = list(person_types.get("person_tags", {}).keys())
    
    class Person(BaseModel):
        name: str = Field(..., description="Full name of the person")
        type: str = Field(..., description=f"One of: {', '.join(valid_types)}")
        person_tags: List[str] = Field(default_factory=list)
        description: str = Field(default="")
        # ... other fields
    
    return Person

This design enables zero-code domain creation: researchers can define new entity types and tags purely through YAML configuration files.

Entity Storage (Parquet)

Entities are stored as one Parquet file per type in the domain output directory:

data/<domain>/entities/
├── people.parquet
├── organizations.parquet
├── locations.parquet
└── events.parquet

Parquet provides:

Fast columnar reads for analytics and filtering
Schema validation ensuring data consistency
Compression reducing disk usage by 10-50x vs. JSON
Append-only writes for incremental processing

Location: src/utils/file_ops.py:write_entities_table

Entity Lifecycle

Extraction

LLM extracts entity from article text using domain-specific prompt

Quality Control

Validates required fields, deduplicates within article, flags low-quality names

Merge Check

Evidence-first cascade determines if entity matches existing record

Profile Generation

Creates or updates versioned narrative profile with citations

Profile Grounding

Batch verification ensures profile claims are supported by source articles

Parquet Write

Atomic write of all entities to respective Parquet files

Name Handling & Canonicalization

Hinbox includes sophisticated name handling to deal with variants and aliases:

Equivalence Groups

Define known aliases in domain config:

dedup:
  name_variants:
    organizations:
      equivalence_groups:
        - ["Department of Defense", "Defense Department", "DoD", "Pentagon"]
        - ["Joint Task Force Guantanamo", "JTF-GTMO", "JTF GTMO"]

Location: configs/<domain>/config.yaml, src/utils/name_variants.py

Acronym Detection

Automatically detects when a name is an acronym form:

from src.utils.name_variants import is_acronym_form, compute_acronym

is_acronym_form("ACLU", "American Civil Liberties Union")  # True
is_acronym_form("DoD", "Department of Defense")  # True

compute_acronym("Department of Defense")  # "DoD"
compute_acronym("Joint Task Force Guantanamo")  # "JTF-G"

Canonical Name Scoring

When multiple name variants exist, the best canonical name is selected based on:

Specificity: Longer, more complete names
Frequency: Number of article mentions
Acronym avoidance: Prefer full names over acronyms for primary identifier

Location: src/utils/name_variants.py:score_canonical_name

Next Steps

Domain Configuration

Define custom entity types, tags, and prompts for your research domain

Processing Pipeline

Learn how entities flow through the 5-stage pipeline

System Architecture

Understand the producer-consumer model and merge cascade

API Reference

Programmatic access to entity data and pipeline stages

Get Started

Core Concepts

Guides

Advanced

Entity Types & Structure

The Four Entity Types

People

Organizations

Locations

Events

Entity Keys & Deduplication

Entity Schema Structure

Common Fields (All Types)

Embedding Fields (Search & Profiles)

Type-Specific Fields

People

Organizations

Locations

Events

Dynamic Pydantic Models

Entity Storage (Parquet)

Entity Lifecycle

Name Handling & Canonicalization

Equivalence Groups

Acronym Detection

Canonical Name Scoring

Next Steps

Domain Configuration

Processing Pipeline

System Architecture

API Reference

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced

​The Four Entity Types

People

Organizations

Locations

Events

​Entity Keys & Deduplication

​Entity Schema Structure

​Common Fields (All Types)

​Embedding Fields (Search & Profiles)

​Type-Specific Fields

​People

​Organizations

​Locations

​Events

​Dynamic Pydantic Models

​Entity Storage (Parquet)

​Entity Lifecycle

​Name Handling & Canonicalization

​Equivalence Groups

​Acronym Detection

​Canonical Name Scoring

​Next Steps

Domain Configuration

Processing Pipeline

System Architecture

API Reference

Build docs developers (and LLMs) love

The Four Entity Types

Entity Keys & Deduplication

Entity Schema Structure

Common Fields (All Types)

Embedding Fields (Search & Profiles)

Type-Specific Fields

People

Organizations

Locations

Events

Dynamic Pydantic Models

Entity Storage (Parquet)

Entity Lifecycle

Name Handling & Canonicalization

Equivalence Groups

Acronym Detection

Canonical Name Scoring

Next Steps