Hinbox processes historical sources stored in Apache Parquet format and outputs extracted entities as Parquet files. This page documents the required schemas and data format requirements.
Source articles must be provided as a Parquet file with the following schema:
Required Columns
Article title or headlineExamples:
- “Guantanamo detainee released after 14 years”
- “Soviet forces withdraw from Afghanistan”
- “Traditional Palestinian food practices documented”
Full text content of the articleThis is the main text from which entities are extracted. Should be clean,
readable text without excessive HTML/markup.Recommended length: 500-5000 words. Shorter articles may not yield many entities;
longer articles are automatically chunked.
Source URL or identifierCan be a web URL, DOI, archive identifier, or local file path. Used for
citations and source tracking.Examples:
https://example.com/articles/2024-01-15-story
doi:10.1234/journal.2024.001
archive://folder/document_123.pdf
Publication or creation dateISO 8601 format recommended: YYYY-MM-DD or YYYY-MM-DDTHH:MM:SSExamples:
2024-03-15
2024-03-15T14:30:00
1989-02-15 (historical date)
Type of source documentUsed for relevance checking and extraction prompt context.Common values:
news_article
journal_article
book_chapter
archival_document
thesis
report
interview_transcript
Optional Columns
Additional columns are preserved but not used by Hinbox:
author - Article author(s)
source_name - Publication name (e.g., “Miami Herald”)
language - Language code (e.g., “en”, “ar”)
keywords - Article keywords/tags
These columns pass through to output files unchanged.
import pyarrow as pa
import pyarrow.parquet as pq
# Define schema
schema = pa.schema([
('title', pa.string()),
('content', pa.string()),
('url', pa.string()),
('published_date', pa.string()),
('source_type', pa.string()),
# Optional fields
('author', pa.string()),
('source_name', pa.string()),
])
# Create sample data
data = [
{
'title': 'Guantanamo detainee released after 14 years',
'content': 'A detainee held at Guantánamo Bay for 14 years...',
'url': 'https://example.com/article/2024-03-15',
'published_date': '2024-03-15',
'source_type': 'news_article',
'author': 'Carol Rosenberg',
'source_name': 'Miami Herald',
}
]
# Write to Parquet
table = pa.Table.from_pylist(data, schema=schema)
pq.write_table(table, 'articles.parquet')
Hinbox outputs four Parquet files per domain, one for each entity type:
data/your_domain/entities/
├── people.parquet
├── organizations.parquet
├── locations.parquet
└── events.parquet
People Schema
schema = pa.schema([
('name', pa.string()), # Person's name
('type', pa.string()), # Person type (e.g., 'detainee', 'lawyer')
('profile', pa.struct([ # Profile information
('text', pa.string()), # Narrative profile text
('tags', pa.list_(pa.string())), # Profile tags
('confidence', pa.float64()), # Profile confidence score
])),
('aliases', pa.list_(pa.string())), # Alternative names
('articles', pa.list_(pa.struct([ # Source articles
('article_id', pa.string()),
('title', pa.string()),
('url', pa.string()),
('published_date', pa.string()),
('extraction_timestamp', pa.string()),
]))),
('confidence', pa.float64()), # Overall confidence
('created_at', pa.string()), # Creation timestamp
('last_updated', pa.string()), # Last update timestamp
('version', pa.int64()), # Profile version number (if versioning enabled)
])
Example record:
{
"name": "Carol Rosenberg",
"type": "journalist",
"profile": {
"text": "Carol Rosenberg is a journalist who has covered Guantánamo Bay...",
"tags": [],
"confidence": 0.92
},
"aliases": ["C. Rosenberg"],
"articles": [
{
"article_id": "article_123",
"title": "Guantanamo detainee released",
"url": "https://example.com/article",
"published_date": "2024-03-15",
"extraction_timestamp": "2024-03-15T10:30:00"
}
],
"confidence": 0.89,
"created_at": "2024-03-15T10:30:00",
"last_updated": "2024-03-15T10:30:00",
"version": 1
}
Organizations Schema
schema = pa.schema([
('name', pa.string()), # Organization name
('type', pa.string()), # Organization type
('profile', pa.struct([ # Profile information
('text', pa.string()),
('tags', pa.list_(pa.string())),
('confidence', pa.float64()),
])),
('aliases', pa.list_(pa.string())), # Acronyms and alternative names
('articles', pa.list_(pa.struct([ # Source articles
('article_id', pa.string()),
('title', pa.string()),
('url', pa.string()),
('published_date', pa.string()),
('extraction_timestamp', pa.string()),
]))),
('confidence', pa.float64()),
('created_at', pa.string()),
('last_updated', pa.string()),
])
Example record:
{
"name": "Department of Defense",
"type": "military",
"profile": {
"text": "The Department of Defense (DoD) is responsible for...",
"tags": [],
"confidence": 0.95
},
"aliases": ["Defense Department", "DoD", "Pentagon"],
"articles": [...],
"confidence": 0.91,
"created_at": "2024-03-15T10:30:00",
"last_updated": "2024-03-15T10:35:00"
}
Locations Schema
schema = pa.schema([
('name', pa.string()), # Location name
('type', pa.string()), # Location type
('profile', pa.struct([ # Profile information
('text', pa.string()),
('tags', pa.list_(pa.string())),
('confidence', pa.float64()),
])),
('articles', pa.list_(pa.struct([ # Source articles
('article_id', pa.string()),
('title', pa.string()),
('url', pa.string()),
('published_date', pa.string()),
('extraction_timestamp', pa.string()),
]))),
('confidence', pa.float64()),
('created_at', pa.string()),
('last_updated', pa.string()),
])
Example record:
{
"name": "Guantanamo Bay",
"type": "detention_facility",
"profile": {
"text": "Guantánamo Bay detention facility, officially known as...",
"tags": [],
"confidence": 0.94
},
"articles": [...],
"confidence": 0.93,
"created_at": "2024-03-15T10:30:00",
"last_updated": "2024-03-15T10:40:00"
}
Events Schema
schema = pa.schema([
('title', pa.string()), # Event title
('type', pa.string()), # Event type
('start_date', pa.string()), # Event date
('profile', pa.struct([ # Profile information
('text', pa.string()),
('tags', pa.list_(pa.string())),
('confidence', pa.float64()),
])),
('articles', pa.list_(pa.struct([ # Source articles
('article_id', pa.string()),
('title', pa.string()),
('url', pa.string()),
('published_date', pa.string()),
('extraction_timestamp', pa.string()),
]))),
('confidence', pa.float64()),
('created_at', pa.string()),
('last_updated', pa.string()),
])
Example record:
{
"title": "Detainee release",
"type": "detention",
"start_date": "2024-03-15",
"profile": {
"text": "A detainee was released from Guantánamo Bay on March 15...",
"tags": ["legal"],
"confidence": 0.88
},
"articles": [...],
"confidence": 0.85,
"created_at": "2024-03-15T10:30:00",
"last_updated": "2024-03-15T10:30:00"
}
From CSV
Convert CSV to Parquet:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
# Read CSV
df = pd.read_csv('articles.csv')
# Ensure required columns exist
required = ['title', 'content', 'url', 'published_date', 'source_type']
for col in required:
if col not in df.columns:
raise ValueError(f"Missing required column: {col}")
# Convert to Parquet
table = pa.Table.from_pandas(df)
pq.write_table(table, 'articles.parquet')
From JSON
Convert JSON Lines (JSONL) to Parquet:
import pyarrow as pa
import pyarrow.parquet as pq
import json
# Read JSONL file
data = []
with open('articles.jsonl', 'r') as f:
for line in f:
data.append(json.loads(line))
# Convert to Parquet
table = pa.Table.from_pylist(data)
pq.write_table(table, 'articles.parquet')
From Web Scraping
Example scraping script:
import pyarrow as pa
import pyarrow.parquet as pq
from datetime import datetime
# Scrape articles (pseudo-code)
articles = []
for url in article_urls:
article = scrape_article(url)
articles.append({
'title': article.title,
'content': article.text,
'url': url,
'published_date': article.date.isoformat(),
'source_type': 'news_article',
'author': article.author,
'source_name': article.publication,
})
# Save to Parquet
table = pa.Table.from_pylist(articles)
pq.write_table(table, 'articles.parquet')
Reading Output Data
Python
import pyarrow.parquet as pq
import pandas as pd
# Read people entities
people_table = pq.read_table('data/guantanamo/entities/people.parquet')
people_df = people_table.to_pandas()
# Access fields
for person in people_df.itertuples():
print(f"{person.name} ({person.type})")
print(f"Profile: {person.profile['text'][:100]}...")
print(f"Articles: {len(person.articles)}")
DuckDB
-- Query Parquet files directly
SELECT name, type, confidence
FROM 'data/guantanamo/entities/people.parquet'
WHERE type = 'journalist'
ORDER BY confidence DESC;
Pandas
import pandas as pd
# Read all entity types
people = pd.read_parquet('data/guantanamo/entities/people.parquet')
orgs = pd.read_parquet('data/guantanamo/entities/organizations.parquet')
locs = pd.read_parquet('data/guantanamo/entities/locations.parquet')
events = pd.read_parquet('data/guantanamo/entities/events.parquet')
# Filter and analyze
journalists = people[people['type'] == 'journalist']
print(f"Found {len(journalists)} journalists")
Data Quality Guidelines
Clean text content
- Remove excessive HTML/markup
- Fix encoding issues (UTF-8 recommended)
- Remove boilerplate (headers, footers, ads)
- Preserve paragraph structure
Consistent dates
- Use ISO 8601 format:
YYYY-MM-DD
- Be consistent across all articles
- Include timezone if available
Unique URLs
- Each article should have a unique URL/identifier
- Used for deduplication and citation tracking
- Can be any stable identifier
Accurate source types
- Use consistent source_type values
- Affects relevance checking and extraction
- Consider your domain’s source mix
Output Data Quality
Hinbox maintains data quality through:
- Deduplication: Entities are merged based on similarity
- Confidence scores: Low-confidence extractions can be filtered
- Source tracking: Every entity links back to source articles
- Versioning: Profile changes are tracked over time (optional)
Storage Recommendations
File Organization
data/
├── domain1/
│ ├── raw_sources/
│ │ └── articles.parquet
│ └── entities/
│ ├── people.parquet
│ ├── organizations.parquet
│ ├── locations.parquet
│ └── events.parquet
├── domain2/
│ ├── raw_sources/
│ └── entities/
└── shared_sources/
└── common_articles.parquet
Backup Strategy
- Version control configs: Git tracks domain configurations
- Backup source data: Raw article Parquet files are source of truth
- Backup entities: Entity Parquet files can be regenerated but backup recommended
- Export options: Convert to CSV/JSON for archival
- Parquet compression: Use Snappy or ZSTD compression
- Partitioning: Partition large datasets by date or source
- Column pruning: Only read columns you need
- Predicate pushdown: Filter in Parquet reader when possible
Example with compression:
import pyarrow.parquet as pq
pq.write_table(
table,
'articles.parquet',
compression='snappy', # or 'zstd' for better compression
)
Troubleshooting
Missing Required Columns
Error: KeyError: 'content'
Solution: Ensure all required columns exist in input Parquet:
import pyarrow.parquet as pq
table = pq.read_table('articles.parquet')
print("Columns:", table.column_names)
# Add missing columns
if 'source_type' not in table.column_names:
import pyarrow.compute as pc
table = table.append_column(
'source_type',
pa.array(['news_article'] * len(table))
)
Encoding Issues
Problem: Non-ASCII characters corrupted
Solution: Ensure UTF-8 encoding:
import pandas as pd
df = pd.read_csv('articles.csv', encoding='utf-8')
df.to_parquet('articles.parquet')
Large Files
Problem: Parquet file too large to load
Solution: Use partitioning or read in batches:
import pyarrow.parquet as pq
# Read in batches
parquet_file = pq.ParquetFile('large_articles.parquet')
for batch in parquet_file.iter_batches(batch_size=1000):
# Process batch
process_batch(batch.to_pandas())
Next Steps
Processing Articles
Process your prepared data
Creating Domains
Set up domain configurations
Configuration
Configure processing settings
Web Interface
Browse extracted entities