Processing Articles

Once you’ve configured your research domain, you can process your historical sources to extract entities. Hinbox uses AI models to identify people, organizations, locations, and events, then merges and deduplicates them into a structured knowledge base.

Basic Processing

Process all articles in your domain:

just process-domain guantanamo

This command:

Loads articles from your configured data source
Checks relevance to your research domain
Extracts entities using AI models
Merges and deduplicates entities
Saves results to Parquet files

Processing Options

Limit Number of Articles

Process only a specific number of articles:

# Process just 5 articles
just process-domain guantanamo --limit 5

Use --limit when testing your configuration or exploring a new dataset. Start small (2-5 articles) to verify extraction quality before processing thousands of documents.

Verbose Output

See detailed extraction information:

just process-domain guantanamo --verbose

Verbose mode shows:

Relevance check decisions
Extracted entity counts per article
Merge decisions (new vs. existing entities)
Processing times per stage

Force Reprocessing

Reprocess articles even if already processed:

just process-domain guantanamo --force

By default, Hinbox skips articles that have already been processed. Use --force to reprocess everything, which is useful after updating prompts or entity definitions.

Combined Options

# Test configuration: process 2 articles with verbose output
just process-domain guantanamo --limit 2 --verbose --force

Processing Pipeline

The processing pipeline consists of several stages:

Load Articles

Hinbox reads articles from your configured Parquet file:

# In config.yaml
data_sources:
  default_path: "data/guantanamo/raw_sources/miami_herald_articles.parquet"

Required columns: title, content, url, published_date, source_type

Check Relevance

The relevance checker filters out articles not relevant to your research domain:

# In logs:
✓ Article is relevant to guantanamo research (confidence: 0.95)
⊗ Article skipped (not relevant to domain)

Configure in config.yaml:

processing:
  relevance_check: true  # Set to false to skip this step

Extract Entities

For each relevant article, Hinbox extracts all four entity types in parallel:

People: Individuals mentioned in the text
Organizations: Groups, agencies, companies, institutions
Locations: Places, facilities, geographic regions
Events: Significant occurrences with dates

Extraction uses prompts from configs/your_domain/prompts/

Merge & Deduplicate

Extracted entities are merged with existing entities using:

Lexical blocking: Fast fuzzy string matching (RapidFuzz)
Embedding similarity: Semantic similarity using embedding models
LLM match checking: AI verification for ambiguous cases

Example merge decisions:

NEW person: Carol Rosenberg (journalist)
MERGE person: C. Rosenberg → Carol Rosenberg (similarity: 0.89)
SKIP person: Carol Rosenberg (exact duplicate)

Save Results

Updated entities are written to Parquet files:

data/guantanamo/entities/
├── people.parquet
├── organizations.parquet
├── locations.parquet
└── events.parquet

Processing status is tracked in a sidecar file to avoid reprocessing.

Understanding the Output

Console Output

During processing, you’ll see structured log output:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Processing Article 1/100
Title: "Guantanamo detainee released after 14 years"
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

✓ Article is relevant (confidence: 0.92)

Extracting entities...
  ✓ Extracted 5 people
  ✓ Extracted 3 organizations
  ✓ Extracted 2 locations
  ✓ Extracted 1 event

Merging entities...
  • NEW person: Abdul Rahman (detainee)
  • MERGE person: Defense Department → Department of Defense
  • SKIP person: Carol Rosenberg (duplicate)

✓ Processing complete (4.2s)

Entity Files

Each entity type is saved to a separate Parquet file with these columns: People (people.parquet):

name: Person’s name
type: Person type (detainee, lawyer, journalist, etc.)
profile: Generated profile with tags and narrative text
aliases: Alternative names
confidence: Extraction confidence score
articles: List of source articles
last_updated: Last modification timestamp

Organizations (organizations.parquet):

name: Organization name
type: Organization type
profile: Description and context
aliases: Alternative names and acronyms
articles: Source articles

Locations (locations.parquet):

name: Location name
type: Location type
profile: Geographic and contextual information
articles: Source articles

Events (events.parquet):

title: Event name
type: Event type
start_date: Event date
profile: Event description and context
articles: Source articles

Advanced Processing

Local Model Processing

Use local Ollama models instead of cloud APIs:

just process-domain guantanamo --local

Requires Ollama installed with a compatible model (e.g., llama3.1:8b).

Concurrency Settings

Configure parallel processing in config.yaml:

performance:
  concurrency:
    extract_workers: 8        # Parallel articles
    extract_per_article: 4    # Parallel entity types per article
    llm_in_flight: 16         # Max concurrent API calls

Higher concurrency speeds up processing but increases API costs and memory usage. Start with defaults and adjust based on your needs.

Extraction Caching

Hinbox caches extraction results to avoid reprocessing unchanged articles:

cache:
  enabled: true
  extraction:
    enabled: true
    subdir: "cache/extractions"
    version: 1  # Bump to invalidate cache
  articles:
    skip_if_unchanged: true  # Skip if content hash matches

Cache is keyed on:

Article content hash
Model name
Prompt text
Schema structure
Temperature setting

Batch Processing

Process articles in batches:

processing:
  batch_size: 5  # Process 5 articles before merging

Monitoring Progress

Check processing statistics:

just check

Output:

Article Database Statistics
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total articles: 1,245
Processed: 856 (68.7%)
Skipped (not relevant): 123 (9.9%)
Pending: 266 (21.4%)

Troubleshooting

Extraction Quality Issues

Problem: Entities not extracted correctly Solutions:

Review extraction prompts in configs/your_domain/prompts/
Add more specific examples to your entity type definitions
Test with --verbose to see extraction decisions
Adjust prompts based on your source types (books vs. articles)

Deduplication Problems

Problem: Same entity appearing multiple times Solutions:

Adjust similarity thresholds in config.yaml:

dedup:
  similarity_thresholds:
    people: 0.82  # Higher = more strict

Add name variant equivalence groups:

dedup:
  name_variants:
    organizations:
      equivalence_groups:
        - ["Department of Defense", "DoD", "Pentagon"]

Performance Issues

Problem: Processing too slow Solutions:

Increase concurrency settings (see above)
Enable extraction caching
Use local models for faster processing
Process in smaller batches with --limit

API Rate Limits

Problem: Cloud API rate limit errors Solutions:

Reduce llm_in_flight concurrency
Switch to local model processing with --local
Process in smaller batches

Processing Workflow

Recommended workflow for a new domain:

Test with small sample

just process-domain your_domain --limit 2 --verbose

Review extraction quality

Check output files and adjust prompts/categories as needed

Process larger sample

just process-domain your_domain --limit 50

Check deduplication

Browse entities in web interface, adjust thresholds if needed

Process full dataset

just process-domain your_domain

Next Steps

Web Interface

Browse and explore your extracted entities

Configuration

Fine-tune deduplication and performance settings

Data Format

Understand the output Parquet schema

Creating Domains

Refine your domain configuration

Get Started

Core Concepts

Guides

Advanced

Processing Articles

Basic Processing

Processing Options

Limit Number of Articles

Verbose Output

Force Reprocessing

Combined Options

Processing Pipeline

Understanding the Output

Console Output

Entity Files

Advanced Processing

Local Model Processing

Concurrency Settings

Extraction Caching

Batch Processing

Monitoring Progress

Troubleshooting

Extraction Quality Issues

Deduplication Problems

Performance Issues

API Rate Limits

Processing Workflow

Next Steps

Web Interface

Configuration

Data Format

Creating Domains

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced

​Basic Processing

​Processing Options

​Limit Number of Articles

​Verbose Output

​Force Reprocessing

​Combined Options

​Processing Pipeline

​Understanding the Output

​Console Output

​Entity Files

​Advanced Processing

​Local Model Processing

​Concurrency Settings

​Extraction Caching

​Batch Processing

​Monitoring Progress

​Troubleshooting

​Extraction Quality Issues

​Deduplication Problems

​Performance Issues

​API Rate Limits

​Processing Workflow

​Next Steps

Web Interface

Configuration

Data Format

Creating Domains

Build docs developers (and LLMs) love

Basic Processing

Processing Options

Limit Number of Articles

Verbose Output

Force Reprocessing

Combined Options

Processing Pipeline

Understanding the Output

Console Output

Entity Files

Advanced Processing

Local Model Processing

Concurrency Settings

Extraction Caching

Batch Processing

Monitoring Progress

Troubleshooting

Extraction Quality Issues

Deduplication Problems

Performance Issues

API Rate Limits

Processing Workflow

Next Steps