Performance Tuning

Overview

Hinbox uses a parallel producer-consumer pipeline with multiple tunable parameters for throughput optimization. Key levers:

Extraction workers: Parallel article processing
Per-article concurrency: Parallel entity types within articles
LLM in-flight limits: Rate limiting for cloud/local APIs
Batch sizes: Embedding computation efficiency
Queue backpressure: Memory management for large runs

Optimal settings depend on your hardware (GPU VRAM, CPU cores), model choice (cloud vs. local), and rate limits.

Concurrency Architecture

From src/process_and_extract.py:10-17:

"""
Concurrency model (Phase 2 speed audit):
  - Multiple extraction workers process articles in parallel via ThreadPoolExecutor.
  - Within each article, the 4 entity-type extractions also run concurrently.
  - A shared LLM semaphore bounds cloud API concurrency.
  - A single merge actor (the main thread) consumes extraction results in article
    order and is the *only* writer to the shared entities dict and
    ProcessingStatus sidecar, so no locking is needed.
"""

Pipeline Stages

Extraction Workers (extract_workers): Process multiple articles concurrently
Entity Type Extraction (extract_per_article): 4 entity types (people, orgs, locations, events) run in parallel per article
LLM Semaphore (llm_in_flight / ollama_in_flight): Bounds concurrent API calls to respect rate limits
Merge Actor (main thread): Single-threaded merge prevents race conditions

Configuration Settings

Settings are defined in configs/{domain}/config.yaml. From configs/guantanamo/config.yaml:49-63:

# Performance / concurrency configuration
performance:
  concurrency:
    extract_workers: 8        # parallel articles in extraction phase
    extract_per_article: 4    # parallel entity types within article
    llm_in_flight: 16         # max concurrent cloud LLM calls
    ollama_in_flight: 2       # max concurrent Ollama calls (local mode)
  queue:
    max_buffered_articles: 32  # backpressure limit for extraction -> merge

# Batching configuration (for embedding calls during merge)
batching:
  embed_batch_size: 64        # texts per embedding API call
  embed_drain_timeout_ms: 100 # reserved for future async drain behaviour

Parameter Guide

`extract_workers`

What it controls: Number of articles processed simultaneously. Tuning:

Cloud mode: Set to 2-4x your llm_in_flight limit
Local mode: Match your CPU core count (e.g., 8 for 8-core system)
Memory constrained: Lower to 4-8 to reduce RAM usage

Example:

extract_workers: 16  # Process 16 articles at once (cloud mode)

`extract_per_article`

What it controls: Parallel entity type extractions within a single article (max 4: people, orgs, locations, events). Tuning:

Default: 4 (all entity types in parallel)
Memory constrained: 2 (extract 2 types at a time)
CPU bottleneck: 1 (sequential extraction)

Example:

extract_per_article: 4  # Extract all 4 entity types concurrently

`llm_in_flight` (Cloud Mode)

What it controls: Maximum concurrent API calls to cloud LLMs (Gemini, GPT, Claude). Tuning:

Gemini Flash: 16-32 (high rate limits)
GPT-4: 4-8 (stricter rate limits)
Claude: 8-16 (moderate rate limits)
Avoid 429 errors: Start low and increase

Example:

llm_in_flight: 24  # Up to 24 concurrent Gemini API calls

`ollama_in_flight` (Local Mode)

What it controls: Maximum concurrent Ollama inference requests. Tuning based on GPU VRAM:

16GB VRAM: 1 (single request at a time)
24GB VRAM: 2 (default, safe for 32B models)
48GB+ VRAM: 4 (can handle 2x parallel 32B or 1x 70B)

Example:

ollama_in_flight: 2  # Safe for 24GB GPU with Qwen 2.5 32B

Setting ollama_in_flight too high causes OOM (out of memory) crashes. Each concurrent request loads a model copy into VRAM.

`max_buffered_articles`

What it controls: Queue size limit between extraction workers and merge actor. Tuning:

Default: 32 (balanced memory usage)
Large RAM systems: 64-128 (more buffering)
Memory constrained: 16 (tighter backpressure)

Example:

max_buffered_articles: 64  # Allow more in-flight work

`embed_batch_size`

What it controls: Number of texts batched in a single embedding API call. Tuning:

Cloud (Jina AI): 64-100 (API supports large batches)
Local (sentence-transformers): 32-64 (GPU memory dependent)
CPU-only local: 16 (smaller batches for CPU inference)

Example:

embed_batch_size: 100  # Max batch for Jina AI cloud embeddings

Loading Configuration

Settings are loaded at runtime from src/process_and_extract.py:798-809:

# Configure LLM concurrency limiter
cc = config.get_concurrency_config()
configure_llm_concurrency(
    cloud_in_flight=cc["llm_in_flight"] if model_type == "gemini" else None,
    local_in_flight=cc["ollama_in_flight"] if model_type == "ollama" else None,
)
log(
    f"Concurrency: {cc['extract_workers']} workers, "
    f"{cc['extract_per_article']} types/article, "
    f"{cc['llm_in_flight']} LLM in-flight",
    level="info",
)

Optimization Recipes

Maximum Cloud Throughput

performance:
  concurrency:
    extract_workers: 32       # High parallelism
    extract_per_article: 4    # All types parallel
    llm_in_flight: 64         # Aggressive cloud API usage
    ollama_in_flight: 2
  queue:
    max_buffered_articles: 128

batching:
  embed_batch_size: 100       # Max Jina batching

Best for: Cloud mode with high rate limits, 16+ CPU cores, 32GB+ RAM.

Local GPU Optimized (24GB VRAM)

performance:
  concurrency:
    extract_workers: 8        # Match CPU cores
    extract_per_article: 4
    llm_in_flight: 16
    ollama_in_flight: 2       # Safe for 32B model
  queue:
    max_buffered_articles: 32

batching:
  embed_batch_size: 64        # Local embedding batching

Best for: Ollama with Qwen 2.5 32B, RTX 3090/4090, 8-core CPU.

Memory Constrained (8GB RAM)

performance:
  concurrency:
    extract_workers: 4        # Low parallelism
    extract_per_article: 2    # Sequential entity types
    llm_in_flight: 8
    ollama_in_flight: 1
  queue:
    max_buffered_articles: 16 # Tight backpressure

batching:
  embed_batch_size: 32

Best for: Laptops, cloud VMs with limited RAM, CPU-only inference.

Balanced Default

performance:
  concurrency:
    extract_workers: 8
    extract_per_article: 4
    llm_in_flight: 16
    ollama_in_flight: 2
  queue:
    max_buffered_articles: 32

batching:
  embed_batch_size: 64

Best for: Most use cases, ships as default in configs/guantanamo/config.yaml.

Monitoring Performance

Extraction Logs

Watch for concurrency indicators:

[INFO] Concurrency: 8 workers, 4 types/article, 16 LLM in-flight
[INFO] Processing 100 articles...
[SUCCESS] Extracted 4 entity types in 2.34s (people=12, orgs=8, locs=5, events=3)

GPU Monitoring (Local Mode)

# Watch GPU usage in real-time
watch -n 1 nvidia-smi

Expected during extraction:

GPU Util: 85-100% (good)
Memory: 20-22GB / 24GB (healthy headroom)
Processes: 1-2 (matches ollama_in_flight)

CPU Monitoring

htop  # Interactive process viewer

Expected:

CPU cores: All cores active during extraction
Python processes: Matches extract_workers setting
RAM usage: Proportional to max_buffered_articles

Bottleneck Diagnosis

Symptom: Low GPU Utilization (under 50%)

Causes:

ollama_in_flight too low (GPU idle waiting for requests)
extract_workers too low (not enough parallel articles)
CPU bottleneck (workers blocked on preprocessing)

Solutions:

Increase ollama_in_flight (if VRAM allows)
Increase extract_workers to match CPU cores
Profile with py-spy to find CPU hotspots

Symptom: Out of Memory Crashes

Causes:

ollama_in_flight too high (multiple model copies in VRAM)
extract_workers too high (too many articles in RAM)
max_buffered_articles too high (queue overflow)

Solutions:

Reduce ollama_in_flight to 1
Reduce extract_workers to 4
Lower max_buffered_articles to 16
Use smaller model (e.g., Qwen 14B instead of 32B)

Symptom: Rate Limit Errors (429)

Causes:

llm_in_flight exceeds cloud API rate limits

Solutions:

Reduce llm_in_flight (try halving it)

Add retry backoff in src/constants.py:28-30:

MAX_RETRIES = 3
BASE_DELAY = 2.0  # Increase to 5.0 for aggressive backoff

Symptom: Slow Merge Phase

Causes:

Large existing entity count (millions of entities)
Embedding similarity search overhead

Solutions:

Enable lexical blocking in config.yaml (already default):

dedup:
  lexical_blocking:
    enabled: true
    threshold: 60
    max_candidates: 50

Increase embed_batch_size for fewer API calls
See Caching to avoid re-embedding

Advanced Tuning

Context Window vs. Speed

From Local Models setup:

export OLLAMA_CONTEXT_LENGTH=32768  # Default, balanced
export OLLAMA_CONTEXT_LENGTH=8192   # Faster, less context
export OLLAMA_CONTEXT_LENGTH=65536  # Slower, more context

Trade-off: Larger context windows use more VRAM and slow inference but improve extraction quality for long documents.

LLM Generation Settings

From src/constants.py:32-35:

# LLM generation defaults
DEFAULT_MAX_TOKENS = 2048
DEFAULT_TEMPERATURE = 0
MAX_ITERATIONS = 3  # Instructor retry attempts

Reduce MAX_ITERATIONS for faster (but less robust) extraction:

MAX_ITERATIONS = 1  # Skip retries, accept first response

Batch Processing

Process in chunks to avoid RAM buildup:

# Process 1000 articles in batches of 100
for i in {0..9}; do
  just process --domain guantanamo --limit 100 --skip $((i * 100))
done

Performance Benchmarks

Cloud (Gemini 2.0 Flash)

Workers	LLM In-Flight	Articles/Sec	Notes
8	16	2.1	Default
16	32	3.8	High rate limit
32	64	5.2	Max throughput

Local (Qwen 2.5 32B, RTX 4090)

Workers	Ollama In-Flight	Articles/Sec	VRAM Usage
4	1	0.6	18GB
8	2	1.2	22GB
8	4	1.4	24GB (OOM risk)

Diminishing returns after 2x ollama_in_flight due to GPU context switching overhead.

Next Steps

Caching

Skip redundant LLM calls with extraction cache

Quality Controls

Balance speed with extraction quality thresholds

Local Models

Choose faster models for local processing

Privacy Mode

Performance considerations for —local flag

Get Started

Core Concepts

Guides

Advanced

​Overview

​Concurrency Architecture

​Pipeline Stages

​Configuration Settings

​Parameter Guide

​extract_workers

​extract_per_article

​llm_in_flight (Cloud Mode)

​ollama_in_flight (Local Mode)

​max_buffered_articles

​embed_batch_size

​Loading Configuration

​Optimization Recipes

​Maximum Cloud Throughput

​Local GPU Optimized (24GB VRAM)

​Memory Constrained (8GB RAM)

​Balanced Default

​Monitoring Performance

​Extraction Logs

​GPU Monitoring (Local Mode)

​CPU Monitoring

​Bottleneck Diagnosis

​Symptom: Low GPU Utilization (under 50%)

​Symptom: Out of Memory Crashes

​Symptom: Rate Limit Errors (429)

​Symptom: Slow Merge Phase

​Advanced Tuning

​Context Window vs. Speed

​LLM Generation Settings

​Batch Processing

​Performance Benchmarks

​Cloud (Gemini 2.0 Flash)

​Local (Qwen 2.5 32B, RTX 4090)

​Next Steps

Caching

Quality Controls

Local Models

Privacy Mode

Build docs developers (and LLMs) love

Overview

Concurrency Architecture

Pipeline Stages

Configuration Settings

Parameter Guide

`extract_workers`

`extract_per_article`

`llm_in_flight` (Cloud Mode)

`ollama_in_flight` (Local Mode)

`max_buffered_articles`

`embed_batch_size`

Loading Configuration

Optimization Recipes

Maximum Cloud Throughput

Local GPU Optimized (24GB VRAM)

Memory Constrained (8GB RAM)

Balanced Default

Monitoring Performance

Extraction Logs

GPU Monitoring (Local Mode)

CPU Monitoring

Bottleneck Diagnosis

Symptom: Low GPU Utilization (under 50%)

Symptom: Out of Memory Crashes

Symptom: Rate Limit Errors (429)

Symptom: Slow Merge Phase

Advanced Tuning

Context Window vs. Speed

LLM Generation Settings

Batch Processing

Performance Benchmarks

Cloud (Gemini 2.0 Flash)

Local (Qwen 2.5 32B, RTX 4090)

Next Steps