Overview
Thejust process command runs the core article processing pipeline that extracts entities (people, events, locations, organizations) from articles using either cloud-based (Gemini) or local (Ollama) language models.
Basic Usage
Main Command
just process
Process articles and extract entities with full control over processing options.
Source: src/process_and_extract.py
Command-Line Arguments
Model Selection
Use local approach (Ollama/spaCy) rather than cloud models (Gemini). Enforces privacy mode:
- Forces local embeddings
- Disables all LLM telemetry callbacks
- No data leaves your machine
false (uses Gemini)Processing Control
Maximum number of articles to process in this run.Examples:
--limit 1— Process one article (useful for testing)--limit 100— Process 100 articles--limit 5— Default behavior
Process articles even if they’ve been processed before. By default, the pipeline skips articles that have already been extracted.Use cases:
- Re-running after config changes
- Testing extraction improvements
- Regenerating profiles with new prompts
falsePerform a relevance check on each article before extraction. Articles deemed irrelevant to the domain are skipped, saving processing time and API costs.Default:
falseDomain Configuration
Domain configuration to use for processing. Determines:
- Entity types to extract
- Extraction prompts
- Output directory
- Merge thresholds
just domains to see all configured domains.Path to the raw articles Parquet file. If not specified, uses the path from the domain configuration.Example:
--articles-path data/custom/articles.parquetOutput & Debugging
Enable verbose logging for debugging. Shows:
- Reflection mechanism details
- Entity merge decisions
- Quality control checks
- LLM interactions
-vDefault: falseDisplay full Rich profile panels during merge. By default, compact output is used.Default:
false (compact output)Domain-Specific Commands
just process-domain
Convenience command for processing a specific domain.
Domain name to process (e.g.,
guantanamo)just test-run
Quick test: process one article with verbose output and force reprocessing.
- Testing extraction configuration
- Debugging entity extraction issues
- Verifying domain setup
Processing Pipeline
Load Configuration
Loads domain config from
configs/<domain>/ including entity types, prompts, and thresholds.Process Articles
For each article:
- Check if already processed (skip unless
--force-reprocess) - Optionally check relevance (if
--relevance-check) - Extract entities (4 types in parallel)
- Merge with existing entities
- Update processing status
Output Files
Processing creates/updates files indata/<domain>/output/:
| File | Description |
|---|---|
people.parquet | Extracted people entities |
events.parquet | Extracted event entities |
locations.parquet | Extracted location entities |
organizations.parquet | Extracted organization entities |
processing_status.json | Article processing status sidecar |
extraction_cache/ | Cached extraction results (content-hash keyed) |
Performance & Concurrency
The pipeline uses concurrent processing configured in your domain config:configs/<domain>/config.yaml
The pipeline uses a producer-consumer pattern:
- Multiple extraction workers process articles in parallel
- A single merge actor (main thread) consumes results in order
- No locking needed since only one thread writes to entities
Terminal Output Examples
Standard Processing
Verbose Mode Output
Error Handling
Common Issues
Articles file not found
Articles file not found
Error:
ERROR: Articles file not found at data/guantanamo/raw_sources/miami_herald_articles.parquetSolution: Ensure your articles Parquet file exists, or specify the path with --articles-path.No API key configured
No API key configured
Error:
GEMINI_API_KEY environment variable not setSolution: Set your API key in .env or use --local for local-only processing..env
Domain not found
Domain not found
Error:
Domain 'myproject' not foundSolution: Initialize the domain first with just init myproject, or use an existing domain (see just domains).Privacy Mode
When using--local, the system enforces complete privacy:
- ✅ All entity extraction uses Ollama (local LLM)
- ✅ All embeddings computed locally (no OpenAI API)
- ✅ LLM telemetry callbacks disabled
- ✅ No data sent to external services
See Also
- Domain Management - Initialize and configure domains
- Data Management - Check articles and reset status
- Configuration Reference - Domain config options