Overview
Hinbox uses a parallel producer-consumer pipeline with multiple tunable parameters for throughput optimization. Key levers:- Extraction workers: Parallel article processing
- Per-article concurrency: Parallel entity types within articles
- LLM in-flight limits: Rate limiting for cloud/local APIs
- Batch sizes: Embedding computation efficiency
- Queue backpressure: Memory management for large runs
Optimal settings depend on your hardware (GPU VRAM, CPU cores), model choice (cloud vs. local), and rate limits.
Concurrency Architecture
Fromsrc/process_and_extract.py:10-17:
Pipeline Stages
- Extraction Workers (
extract_workers): Process multiple articles concurrently - Entity Type Extraction (
extract_per_article): 4 entity types (people, orgs, locations, events) run in parallel per article - LLM Semaphore (
llm_in_flight/ollama_in_flight): Bounds concurrent API calls to respect rate limits - Merge Actor (main thread): Single-threaded merge prevents race conditions
Configuration Settings
Settings are defined inconfigs/{domain}/config.yaml. From configs/guantanamo/config.yaml:49-63:
Parameter Guide
extract_workers
What it controls: Number of articles processed simultaneously.
Tuning:
- Cloud mode: Set to 2-4x your
llm_in_flightlimit - Local mode: Match your CPU core count (e.g., 8 for 8-core system)
- Memory constrained: Lower to 4-8 to reduce RAM usage
extract_per_article
What it controls: Parallel entity type extractions within a single article (max 4: people, orgs, locations, events).
Tuning:
- Default:
4(all entity types in parallel) - Memory constrained:
2(extract 2 types at a time) - CPU bottleneck:
1(sequential extraction)
llm_in_flight (Cloud Mode)
What it controls: Maximum concurrent API calls to cloud LLMs (Gemini, GPT, Claude).
Tuning:
- Gemini Flash:
16-32(high rate limits) - GPT-4:
4-8(stricter rate limits) - Claude:
8-16(moderate rate limits) - Avoid 429 errors: Start low and increase
ollama_in_flight (Local Mode)
What it controls: Maximum concurrent Ollama inference requests.
Tuning based on GPU VRAM:
- 16GB VRAM:
1(single request at a time) - 24GB VRAM:
2(default, safe for 32B models) - 48GB+ VRAM:
4(can handle 2x parallel 32B or 1x 70B)
max_buffered_articles
What it controls: Queue size limit between extraction workers and merge actor.
Tuning:
- Default:
32(balanced memory usage) - Large RAM systems:
64-128(more buffering) - Memory constrained:
16(tighter backpressure)
embed_batch_size
What it controls: Number of texts batched in a single embedding API call.
Tuning:
- Cloud (Jina AI):
64-100(API supports large batches) - Local (sentence-transformers):
32-64(GPU memory dependent) - CPU-only local:
16(smaller batches for CPU inference)
Loading Configuration
Settings are loaded at runtime fromsrc/process_and_extract.py:798-809:
Optimization Recipes
Maximum Cloud Throughput
Local GPU Optimized (24GB VRAM)
Memory Constrained (8GB RAM)
Balanced Default
configs/guantanamo/config.yaml.
Monitoring Performance
Extraction Logs
Watch for concurrency indicators:GPU Monitoring (Local Mode)
- GPU Util: 85-100% (good)
- Memory: 20-22GB / 24GB (healthy headroom)
- Processes: 1-2 (matches
ollama_in_flight)
CPU Monitoring
- CPU cores: All cores active during extraction
- Python processes: Matches
extract_workerssetting - RAM usage: Proportional to
max_buffered_articles
Bottleneck Diagnosis
Symptom: Low GPU Utilization (under 50%)
Causes:ollama_in_flighttoo low (GPU idle waiting for requests)extract_workerstoo low (not enough parallel articles)- CPU bottleneck (workers blocked on preprocessing)
- Increase
ollama_in_flight(if VRAM allows) - Increase
extract_workersto match CPU cores - Profile with
py-spyto find CPU hotspots
Symptom: Out of Memory Crashes
Causes:ollama_in_flighttoo high (multiple model copies in VRAM)extract_workerstoo high (too many articles in RAM)max_buffered_articlestoo high (queue overflow)
- Reduce
ollama_in_flightto1 - Reduce
extract_workersto4 - Lower
max_buffered_articlesto16 - Use smaller model (e.g., Qwen 14B instead of 32B)
Symptom: Rate Limit Errors (429)
Causes:llm_in_flightexceeds cloud API rate limits
- Reduce
llm_in_flight(try halving it) - Add retry backoff in
src/constants.py:28-30:
Symptom: Slow Merge Phase
Causes:- Large existing entity count (millions of entities)
- Embedding similarity search overhead
- Enable lexical blocking in config.yaml (already default):
- Increase
embed_batch_sizefor fewer API calls - See Caching to avoid re-embedding
Advanced Tuning
Context Window vs. Speed
From Local Models setup:LLM Generation Settings
Fromsrc/constants.py:32-35:
MAX_ITERATIONS for faster (but less robust) extraction:
Batch Processing
Process in chunks to avoid RAM buildup:Performance Benchmarks
Cloud (Gemini 2.0 Flash)
| Workers | LLM In-Flight | Articles/Sec | Notes |
|---|---|---|---|
| 8 | 16 | 2.1 | Default |
| 16 | 32 | 3.8 | High rate limit |
| 32 | 64 | 5.2 | Max throughput |
Local (Qwen 2.5 32B, RTX 4090)
| Workers | Ollama In-Flight | Articles/Sec | VRAM Usage |
|---|---|---|---|
| 4 | 1 | 0.6 | 18GB |
| 8 | 2 | 1.2 | 22GB |
| 8 | 4 | 1.4 | 24GB (OOM risk) |
Next Steps
Caching
Skip redundant LLM calls with extraction cache
Quality Controls
Balance speed with extraction quality thresholds
Local Models
Choose faster models for local processing
Privacy Mode
Performance considerations for —local flag