Skip to main content

Overview

Data management commands help you inspect articles, reset processing state, and launch the web interface for browsing extracted entities.

Article Statistics

just check

Check article database statistics and processing status.
just check [--sample]
Alias: just stats Source: scripts/check_articles_parquet.py
--sample
flag
Display details of a sample article from the dataset, including content preview and processing metadata.Default: false

Usage Examples

just check

Output Format

Basic Statistics

$ just check

Checking articles in parquet file

Total articles in parquet file: 1,247

┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┓
 Metric Count Percentage
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━┩
 Total Articles 1,247 100%
 Processed 342 27.4%
 Unprocessed 905 72.6%
 Relevance Checked 425 34.1%
└─────────────────────┴────────┴────────────┘

Date Range:
  Oldest: First Reports Surface of Detention Facility Opening (2002-01-15)
  Newest: Supreme Court Hears Guantanamo Bay Case Arguments (2024-03-12)

Check completed

Statistics Breakdown

MetricDescription
Total ArticlesAll articles in the Parquet file
ProcessedArticles that have completed entity extraction
UnprocessedArticles awaiting processing
Relevance CheckedArticles that have been evaluated for domain relevance

Sample Article Output

$ just check --sample

# ... statistics table ...

Sample Article:
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 Field Value
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 Title Pentagon Releases Names of Guantanamo Detainees
 URL https://www.miamiherald.com/news/article123...
 Published 2006-04-20
 Author Carol Rosenberg
 Content Preview The Pentagon on Thursday released the names of...
 Processed True
 Processing Date 2024-03-15T14:23:45
└──────────────────┴────────────────────────────────────────────────────┘

Processing Status Reset

just reset

Reset processing status of all articles, allowing them to be reprocessed.
just reset
Source: scripts/reset_processing_status.py
This command modifies the articles Parquet file in-place. It does NOT delete extracted entities, only clears the “processed” flag on articles.

Interactive Confirmation

The command prompts for confirmation before resetting:
$ just reset

This will reset ALL articles. Are you sure? (y/N): y
Reset processing status for 1,247 articles
To cancel:
$ just reset

This will reset ALL articles. Are you sure? (y/N): n
Reset cancelled.

Use Cases

When you’ve modified extraction prompts or entity schemas and want to reprocess all articles with the new configuration.
When you’ve upgraded your LLM model or improved extraction logic and want to regenerate all entities.
During development when you need to test the full pipeline repeatedly.
If a previous processing run was interrupted and you want to start fresh.
The reset command only affects article metadata. Your extracted entities in data/<domain>/output/*.parquet are not deleted. To start completely fresh, manually delete the entity Parquet files as well.

Web Interface

just frontend

Start the web interface for browsing extracted entities.
just frontend
Aliases: just web, just ui Source: src/frontend/ Port: 5001

Accessing the Interface

After starting the server:
$ just frontend

Open http://localhost:5001 in your browser
 * Serving FastHTML app
 * Running on http://localhost:5001

Open Web Interface

Click to open the Hinbox web interface (when running locally)

Web Interface Features

Home Page

  • Entity counts by type (people, events, locations, organizations)
  • Recent entities grid with confidence indicators
  • Domain switcher to view different projects

Entity Browse Pages

Browse all extracted people with:
  • Name and aliases
  • Roles and affiliations
  • Tag filters (military, political, legal, etc.)
  • Confidence scores
  • Related article counts

Entity Detail Pages

Each entity has a detail page showing:
  • Profile text: AI-generated summary from source articles
  • Profile versions: Historical versions with timestamps
  • Confidence score: Extraction quality indicator
  • Aliases: Alternative names found in sources
  • Tags: Categorization labels
  • Related articles: Source articles with citations
  • Grounding report: Citation verification scores

Design System

The interface uses the “Archival Elegance” design system:
  • Fonts: Crimson Pro (headings), IBM Plex Sans (body)
  • Colors: Warm teal-slate primary, amber accents
  • Layout: Sidebar filters + main content area
  • Style: Minimalist, research-focused aesthetic
The frontend is built with FastHTML for fast, server-rendered HTML with minimal JavaScript.

Data Inspection Workflow

1

Check Article Status

Run just check to see how many articles are processed vs. unprocessed.
2

Inspect Sample

Run just check --sample to see an example article and verify data quality.
3

Process Articles

Run just process --limit N to extract entities from N articles.
4

Launch Web Interface

Run just frontend to browse extracted entities in your browser.
5

Review Results

Use the web interface to review entity quality, profiles, and citations.
6

Iterate if Needed

If quality is low, adjust prompts/config, run just reset, and reprocess.

File Locations

Data management commands operate on these files:
FilePurposeModified By
data/<domain>/raw_sources/articles.parquetSource articlesjust reset
data/<domain>/output/processing_status.jsonProcessing sidecarjust process
data/<domain>/output/*.parquetExtracted entitiesjust process
Always backup your data before running just reset, especially if you have manual corrections to your articles.

Additional Data Commands

Miami Herald Specific (Legacy)

These commands are specific to the Guantanamo Bay / Miami Herald dataset:
Fetch Miami Herald articles from the source API.Source: scripts/get_miami_herald_articles.py
just fetch-miami
This is a legacy command for the original Guantanamo domain. For custom domains, you’ll provide your own article sources.
Import Miami Herald articles from JSONL format to Parquet.Source: scripts/import_miami_herald_articles.py
just import-miami
Converts JSONL article exports into the Parquet format used by the pipeline.

Monitoring Processing Progress

Combine commands to monitor progress:
# Initial state
just check

# Process 50 articles
just process --limit 50

# Check progress
just check

# Continue processing
just process --limit 50

# View results in browser
just frontend

Command Reference Summary

CommandPurposeInteractive
just checkShow article statisticsNo
just check --sampleShow stats + sample articleNo
just statsAlias for just checkNo
just resetReset processing statusYes (confirms)
just frontendStart web interfaceYes (server)
just webAlias for just frontendYes (server)
just uiAlias for just frontendYes (server)

Troubleshooting

Error: ERROR: Articles file not found at data/.../articles.parquetSolution: Ensure your articles Parquet file exists at the path specified in your domain config, or provide a custom path with --articles-path.
Error: OSError: [Errno 48] Address already in useSolution: Another process is using port 5001. Either:
  • Stop the other process
  • Kill the existing frontend: pkill -f "python -m src.frontend"
  • Change the port in src/frontend/app_config.py
Cause: No articles have been processed yet, or entity Parquet files are missing.Solution:
  1. Run just check to verify processing status
  2. Run just process --limit 10 to process some articles
  3. Refresh the frontend browser page
Solution: Simply run just reset again and enter y when prompted.

See Also

Build docs developers (and LLMs) love