configs/<domain>/categories/.
The Four Entity Types
People
Individuals mentioned in source documents (detainees, officials, lawyers, journalists, etc.)
Organizations
Institutions and groups (military units, NGOs, government agencies, media outlets)
Locations
Geographic places and facilities (detention centers, military bases, cities, countries)
Events
Temporal occurrences (legal proceedings, protests, transfers, policy changes)
Entity Keys & Deduplication
Each entity type uses a different key strategy for deduplication:| Entity Type | Key Fields | Key Type | Example |
|---|---|---|---|
| People | name | str | "Carol Rosenberg" |
| Organizations | name, type | tuple | ("ACLU", "legal") |
| Locations | name, type | tuple | ("Guantanamo Bay", "detention_facility") |
| Events | title, start_date | tuple | ("Rasul v. Bush", "2004-06-28") |
src/engine/mergers.py:114
Tuple keys enable the same base name to represent different entities when types differ. For example:
("Washington", "city")vs.("Washington", "state")("Red Cross", "humanitarian")vs.("Red Cross Hospital", "location")
Entity Schema Structure
All entities share a common base structure with type-specific extensions:Common Fields (All Types)
Primary identifier for people, organizations, and locations
Entity type from domain-specific category definitions
Brief description of the entity extracted from source text
List of alternative names, aliases, or abbreviations (e.g.,
["DoD", "Pentagon"] for Department of Defense)Versioned narrative profile with sources and confidence
List of article IDs mentioning this entity
Total number of articles mentioning this entity
ISO 8601 timestamp of first extraction
ISO 8601 timestamp of most recent update
Embedding Fields (Search & Profiles)
Evidence-derived embedding vector for similarity search (preferred for merge decisions)
Model used for search embedding (e.g.,
"jina_ai/jina-embeddings-v3")Hash fingerprint of embedding model + parameters for compatibility checks
LLM-narrative-derived embedding (fallback for older entities)
Hinbox prefers
search_embedding over profile_embedding for merge decisions because evidence text provides apples-to-apples comparison between new and existing entities. See Evidence-First Similarity for details.Type-Specific Fields
People
Domain-specific tags (e.g.,
["civil_rights", "lawyer"] in Guantanamo domain)Historical roles extracted from articles (e.g.,
["Defense Attorney", "ACLU Lawyer"])- Example: Person Entity
- Category Definition (people.yaml)
Organizations
Common acronym (e.g.,
"ACLU", "DoD", "JTF-GTMO")Name of parent organization if applicable
- Example: Organization Entity
- Category Definition (organizations.yaml)
Locations
Geographic coordinates if available
Country name for location disambiguation
- Example: Location Entity
- Category Definition (locations.yaml)
Events
Event title (part of composite key with
start_date)ISO 8601 date when event started (part of composite key)
ISO 8601 date when event ended (for ongoing events, may be null)
Domain-specific tags (e.g.,
["legal_challenge", "habeas_corpus"])Names of people or organizations involved in the event
- Example: Event Entity
- Category Definition (events.yaml)
Dynamic Pydantic Models
Entity schemas are dynamically generated from domain configurations at runtime usingsrc/dynamic_models.py. This allows each research domain to define custom types and tags.
Location: src/dynamic_models.py
This design enables zero-code domain creation: researchers can define new entity types and tags purely through YAML configuration files.
Entity Storage (Parquet)
Entities are stored as one Parquet file per type in the domain output directory:- Fast columnar reads for analytics and filtering
- Schema validation ensuring data consistency
- Compression reducing disk usage by 10-50x vs. JSON
- Append-only writes for incremental processing
src/utils/file_ops.py:write_entities_table
Entity Lifecycle
Name Handling & Canonicalization
Hinbox includes sophisticated name handling to deal with variants and aliases:Equivalence Groups
Define known aliases in domain config:configs/<domain>/config.yaml, src/utils/name_variants.py
Acronym Detection
Automatically detects when a name is an acronym form:Canonical Name Scoring
When multiple name variants exist, the best canonical name is selected based on:- Specificity: Longer, more complete names
- Frequency: Number of article mentions
- Acronym avoidance: Prefer full names over acronyms for primary identifier
src/utils/name_variants.py:score_canonical_name
Next Steps
Domain Configuration
Define custom entity types, tags, and prompts for your research domain
Processing Pipeline
Learn how entities flow through the 5-stage pipeline
System Architecture
Understand the producer-consumer model and merge cascade
API Reference
Programmatic access to entity data and pipeline stages