Basic Processing
Process all articles in your domain:- Loads articles from your configured data source
- Checks relevance to your research domain
- Extracts entities using AI models
- Merges and deduplicates entities
- Saves results to Parquet files
Processing Options
Limit Number of Articles
Process only a specific number of articles:Use
--limit when testing your configuration or exploring a new dataset.
Start small (2-5 articles) to verify extraction quality before processing
thousands of documents.Verbose Output
See detailed extraction information:- Relevance check decisions
- Extracted entity counts per article
- Merge decisions (new vs. existing entities)
- Processing times per stage
Force Reprocessing
Reprocess articles even if already processed:Combined Options
Processing Pipeline
The processing pipeline consists of several stages:Load Articles
Hinbox reads articles from your configured Parquet file:Required columns:
title, content, url, published_date, source_typeCheck Relevance
The relevance checker filters out articles not relevant to your research domain:Configure in
config.yaml:Extract Entities
For each relevant article, Hinbox extracts all four entity types in parallel:
- People: Individuals mentioned in the text
- Organizations: Groups, agencies, companies, institutions
- Locations: Places, facilities, geographic regions
- Events: Significant occurrences with dates
configs/your_domain/prompts/Merge & Deduplicate
Extracted entities are merged with existing entities using:
- Lexical blocking: Fast fuzzy string matching (RapidFuzz)
- Embedding similarity: Semantic similarity using embedding models
- LLM match checking: AI verification for ambiguous cases
Understanding the Output
Console Output
During processing, you’ll see structured log output:Entity Files
Each entity type is saved to a separate Parquet file with these columns: People (people.parquet):
name: Person’s nametype: Person type (detainee, lawyer, journalist, etc.)profile: Generated profile with tags and narrative textaliases: Alternative namesconfidence: Extraction confidence scorearticles: List of source articleslast_updated: Last modification timestamp
organizations.parquet):
name: Organization nametype: Organization typeprofile: Description and contextaliases: Alternative names and acronymsarticles: Source articles
locations.parquet):
name: Location nametype: Location typeprofile: Geographic and contextual informationarticles: Source articles
events.parquet):
title: Event nametype: Event typestart_date: Event dateprofile: Event description and contextarticles: Source articles
Advanced Processing
Local Model Processing
Use local Ollama models instead of cloud APIs:llama3.1:8b).
Concurrency Settings
Configure parallel processing inconfig.yaml:
Higher concurrency speeds up processing but increases API costs and memory
usage. Start with defaults and adjust based on your needs.
Extraction Caching
Hinbox caches extraction results to avoid reprocessing unchanged articles:- Article content hash
- Model name
- Prompt text
- Schema structure
- Temperature setting
Batch Processing
Process articles in batches:Monitoring Progress
Check processing statistics:Troubleshooting
Extraction Quality Issues
Problem: Entities not extracted correctly Solutions:- Review extraction prompts in
configs/your_domain/prompts/ - Add more specific examples to your entity type definitions
- Test with
--verboseto see extraction decisions - Adjust prompts based on your source types (books vs. articles)
Deduplication Problems
Problem: Same entity appearing multiple times Solutions:- Adjust similarity thresholds in
config.yaml: - Add name variant equivalence groups:
Performance Issues
Problem: Processing too slow Solutions:- Increase concurrency settings (see above)
- Enable extraction caching
- Use local models for faster processing
- Process in smaller batches with
--limit
API Rate Limits
Problem: Cloud API rate limit errors Solutions:- Reduce
llm_in_flightconcurrency - Switch to local model processing with
--local - Process in smaller batches
Processing Workflow
Recommended workflow for a new domain:Next Steps
Web Interface
Browse and explore your extracted entities
Configuration
Fine-tune deduplication and performance settings
Data Format
Understand the output Parquet schema
Creating Domains
Refine your domain configuration