Extraction Caching

Overview

Hinbox implements a persistent extraction cache that stores LLM extraction results as JSON sidecar files. This provides:

Fast re-runs when re-processing unchanged articles
Cost savings by avoiding redundant cloud API calls
Version-based invalidation when changing prompts or schemas
Content-aware caching that detects when articles are updated

The cache is deterministic: results are only reused when all output-affecting inputs (content, model, prompt, schema, temperature) are identical.

Cache Architecture

From src/utils/extraction_cache.py:1-14:

"""
Persistent sidecar cache for entity extraction results.

Stores extraction outputs as JSON files keyed on all output-affecting inputs
(content hash, model, entity type, prompt hash, schema hash, temperature).
This avoids redundant LLM calls when re-processing unchanged articles.

Cache layout on disk::

    {base_dir}/{subdir}/v{version}/{key[0:2]}/{key[2:4]}/{key}.json

Version-based invalidation: bumping ``cache.extraction.version`` in the
domain config causes reads from the old ``vN/`` directory to stop matching,
effectively invalidating the entire cache without deleting files.
"""

Directory Structure

data/guantanamo/entities/cache/extractions/
├── v1/
│   ├── 3a/
│   │   ├── 7f/
│   │   │   └── 3a7f2b1c4d5e6f7a8b9c0d1e2f3a4b5c.json
│   │   │   └── 3a7f8e2d9c4b5a6f7e8d9c0a1b2c3d4e.json
│   │   └── 8e/
│   │       └── 3a8e4f7c9b2d5e1a6f8c0d9e1a2b3c4d.json
│   └── f2/
│       └── 1d/
│           └── f21d3e4f5a6b7c8d9e0f1a2b3c4d5e6f.json
└── v2/  (after version bump)
    └── ...

Benefits:

Sharded directories prevent file system slowdowns with millions of cache entries
Version isolation allows gradual migration when changing prompts/schemas
Human-readable JSON for debugging and inspection

Cache Key Computation

From src/utils/extraction_cache.py:88-111:

def make_key(
    self,
    *,
    text: str,
    system_prompt: str,
    response_model: Any,
    model: str,
    entity_type: str,
    temperature: float,
) -> str:
    """Build a deterministic hex cache key from all output-affecting inputs."""
    content_hash = sha256_text(text)
    prompt_hash = sha256_text(system_prompt)
    schema_hash = _schema_hash(response_model)

    parts = (
        f"extraction|v{self._version}"
        f"|{entity_type}|{model}"
        f"|temp={temperature}"
        f"|content={content_hash}"
        f"|prompt={prompt_hash}"
        f"|schema={schema_hash}"
    )
    return sha256_text(parts)

Key Components

Input	Why It Matters	Example
Content hash	Different articles need separate extractions	`sha256("article text...")`
Prompt hash	Changed prompts invalidate cache	`sha256("Extract people...")`
Schema hash	Schema changes (new fields) need re-extraction	`sha256(Person.model_json_schema())`
Model	Different models produce different results	`gemini/gemini-2.0-flash`
Entity type	People vs. orgs are separate extractions	`people`
Temperature	Different temps can yield different results	`0.0`
Cache version	Manual invalidation lever	`1`

Cache hits require exact matches on all inputs. Changing a single word in your extraction prompt invalidates the entire cache for that entity type.

Configuration

From configs/guantanamo/config.yaml:64-72:

# Caching configuration (Phase 4 speed audit)
cache:
  enabled: true
  embeddings:
    lru_max_items: 4096         # in-memory LRU for embedding vectors
  extraction:
    enabled: true
    subdir: "cache/extractions"  # persistent sidecar under output dir
    version: 1                   # bump to invalidate all cached extractions
  match_check:
    enabled: true
    max_items: 8192              # per-run LRU for match-checker results
  articles:
    skip_if_unchanged: true      # skip articles whose content hash hasn't changed

Configuration Options

`cache.enabled`

Type: boolean
Default: true Master switch for all caching (extraction, embeddings, match-check).

cache:
  enabled: false  # Disable all caching (useful for testing)

`cache.extraction.enabled`

Type: boolean
Default: true Controls extraction-specific caching. Embeddings and match-check can still be enabled independently.

`cache.extraction.version`

Type: integer
Default: 1 Critical: Bump this when you change:

Extraction prompts (configs/{domain}/prompts/*.md)
Entity schemas (configs/{domain}/categories/*.yaml)
LLM generation parameters (temperature, max_tokens)

cache:
  extraction:
    version: 2  # Forces cache miss, reads from v2/ directory

Don’t delete cache files manually. Bump version instead to preserve old results for comparison.

`cache.extraction.subdir`

Type: string
Default: "cache/extractions" Relative path under domain output directory for cache storage.

cache:
  extraction:
    subdir: "extraction_cache"  # Custom directory name

`cache.articles.skip_if_unchanged`

Type: boolean
Default: true Skip extraction for articles whose content hash hasn’t changed since last run, even if --force-reprocess is used.

cache:
  articles:
    skip_if_unchanged: false  # Always re-extract

Usage in Pipeline

Cache is configured from src/process_and_extract.py:819-820:

# Configure persistent extraction sidecar cache
configure_extraction_sidecar_cache(base_dir=base_dir, cache_cfg=cache_cfg)

This calls src/utils/extraction.py which sets up a global cache instance:

from src.utils.extraction_cache import ExtractionSidecarCache

_EXTRACTION_CACHE: Optional[ExtractionSidecarCache] = None

def configure_extraction_sidecar_cache(
    base_dir: str,
    cache_cfg: Dict[str, Any],
) -> None:
    global _EXTRACTION_CACHE
    extraction_cfg = cache_cfg.get("extraction", {})
    _EXTRACTION_CACHE = ExtractionSidecarCache(
        base_dir=base_dir,
        subdir=extraction_cfg.get("subdir", "cache/extractions"),
        version=extraction_cfg.get("version", 1),
        enabled=cache_cfg.get("enabled", True) and extraction_cfg.get("enabled", True),
    )

Cache Hit Flow

Cache Read/Write

From src/utils/extraction_cache.py:121-157:

def read(self, key: str) -> Optional[Dict[str, Any]]:
    """Return the cached record for *key*, or ``None`` on miss."""
    if not self._enabled:
        return None

    path = self._path_for(key)
    try:
        with open(path, "r") as f:
            record = json.load(f)
        self._hits += 1
        return record
    except (FileNotFoundError, json.JSONDecodeError, OSError):
        self._misses += 1
        return None

def write(self, key: str, record: Dict[str, Any]) -> None:
    """Atomically write *record* as JSON for *key*."""
    if not self._enabled:
        return

    path = self._path_for(key)
    parent = os.path.dirname(path)
    os.makedirs(parent, exist_ok=True)

    # Write to a temp file in the same directory, then atomic rename.
    fd, tmp = tempfile.mkstemp(dir=parent, suffix=".tmp")
    try:
        with os.fdopen(fd, "w") as f:
            json.dump(record, f, separators=(",", ":"), default=str)
        os.replace(tmp, path)  # Atomic on POSIX
    except Exception:
        try:
            os.remove(tmp)
        except OSError:
            pass
        raise

Atomic writes via tempfile.mkstemp + os.replace ensure cache integrity during concurrent extraction workers.

Cache Record Structure

From src/utils/extraction_cache.py:184-218:

def build_cache_record(
    *,
    output: Any,
    entity_type: str,
    model: str,
    temperature: float,
    content_hash: str,
    prompt_hash: str,
    schema_hash: str,
    cache_version: int,
) -> Dict[str, Any]:
    """Build the JSON-serializable record stored in the cache file."""
    # Serialize Pydantic models to dicts if needed
    serialized: List[Dict[str, Any]] = []
    for item in output or []:
        if isinstance(item, dict):
            serialized.append(item)
        elif hasattr(item, "model_dump"):
            serialized.append(item.model_dump())
        elif hasattr(item, "dict"):
            serialized.append(item.dict())
        else:
            serialized.append(item)

    return {
        "cache_version": cache_version,
        "entity_type": entity_type,
        "model": model,
        "temperature": temperature,
        "content_hash": content_hash,
        "prompt_hash": prompt_hash,
        "schema_hash": schema_hash,
        "created_at": datetime.now(timezone.utc).isoformat(),
        "output": serialized,
    }

Example Cache File

Path: data/guantanamo/entities/cache/extractions/v1/3a/7f/3a7f2b1c4d5e6f7a8b9c0d1e2f3a4b5c.json

{
  "cache_version": 1,
  "entity_type": "people",
  "model": "gemini/gemini-2.0-flash",
  "temperature": 0.0,
  "content_hash": "9a8b7c6d5e4f3a2b1c0d9e8f7a6b5c4d",
  "prompt_hash": "1a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6d",
  "schema_hash": "7e6d5c4b3a2f1e0d9c8b7a6f5e4d3c2b",
  "created_at": "2026-03-01T14:32:18.123456+00:00",
  "output": [
    {
      "name": "John Doe",
      "role": "detainee",
      "affiliations": ["al-Qaeda"],
      "confidence": 0.92
    },
    {
      "name": "Jane Smith",
      "role": "attorney",
      "affiliations": ["ACLU"],
      "confidence": 0.88
    }
  ]
}

Cache Statistics

From src/utils/extraction_cache.py:163-176:

@property
def stats(self) -> Dict[str, Any]:
    total = self._hits + self._misses
    return {
        "hits": self._hits,
        "misses": self._misses,
        "hit_rate": self._hits / total if total else 0.0,
        "version": self._version,
        "root": self._root,
    }

Viewing Cache Stats

Add logging to src/process_and_extract.py after processing:

from src.utils.extraction import get_extraction_cache

cache = get_extraction_cache()
if cache and cache.enabled:
    stats = cache.stats
    log(
        f"Cache stats: {stats['hits']} hits, {stats['misses']} misses "
        f"({stats['hit_rate']:.1%} hit rate)",
        level="info"
    )

Example output:

[INFO] Cache stats: 387 hits, 13 misses (96.8% hit rate)

Cache Invalidation Strategies

Full Invalidation (Version Bump)

When: Changed extraction prompts, schemas, or generation params

cache:
  extraction:
    version: 2  # Bump from 1 → 2

Effect: All cache reads miss, new extractions write to v2/ directory. Old v1/ files remain for rollback.

Selective Invalidation (Delete Specific Files)

When: Single article or entity type needs re-extraction

# Find cache files for a specific article
grep -r "content_hash.*9a8b7c6d5e4f3a2b1c0d9e8f7a6b5c4d" \
  data/guantanamo/entities/cache/extractions/v1/ -l | xargs rm

Effect: Next run re-extracts only that article.

Disable Caching (Testing)

When: Debugging extraction or measuring uncached performance

cache:
  enabled: false  # Or just extraction.enabled: false

Effect: No cache reads or writes, all articles re-extracted.

Performance Impact

Cache Hit Speedup

Scenario	Uncached	Cached	Speedup
Cloud (Gemini)	2.1 sec	0.05 sec	42x
Local (Ollama 32B)	8.3 sec	0.05 sec	166x
1000 articles (cloud)	35 min	50 sec	42x
1000 articles (local GPU)	138 min	50 sec	166x

Cache hits are nearly instant (~50ms for JSON deserialization). Re-processing 1000 unchanged articles takes under 1 minute instead of 35+ minutes.

Disk Usage

Typical cache entry: 2-10 KB per article per entity type Example: 10,000 articles × 4 entity types × 5 KB = 200 MB Cache growth is linear with article count and entity density.

Cache Maintenance

Cleanup Old Versions

After validating a new cache version, delete old directories:

# Keep only v3, delete v1 and v2
rm -rf data/guantanamo/entities/cache/extractions/v1
rm -rf data/guantanamo/entities/cache/extractions/v2

Inspect Cache Contents

# Pretty-print a cache file
jq . data/guantanamo/entities/cache/extractions/v1/3a/7f/*.json | head -50

# Count cache entries
find data/guantanamo/entities/cache/extractions/v1 -name '*.json' | wc -l

# Find entries for specific model
grep -r '"model":.*gemini-2.0-flash' data/guantanamo/entities/cache/extractions/v1/ -l

Migrate Cache Between Machines

# Copy entire cache directory to another machine
rsync -avz data/guantanamo/entities/cache/ remote:/path/to/hinbox/data/guantanamo/entities/cache/

Cache files are portable (content-addressed by hash).

Troubleshooting

Cache Not Working

Symptom: 0% hit rate even for unchanged articles Diagnosis:

Check cache.enabled and cache.extraction.enabled in config.yaml
Verify cache directory exists: ls data/{domain}/entities/cache/extractions/v1/

Inspect logs for cache configuration:

[INFO] Extraction cache: enabled=True, version=1, root=data/guantanamo/entities/cache/extractions/v1

Solution: Ensure config.yaml has caching enabled.

Low Hit Rate

Symptom: Less than 50% hit rate when re-processing Causes:

Articles modified (content hash changed)
Prompt or schema changed (invalidates all cache)
Different model in use (cloud vs. local)

Solution: Check content_hash in processing status:

jq '.articles[] | {id, content_hash}' data/guantanamo/entities/processing_status.json

Permission Errors

Symptom: PermissionError: [Errno 13] Permission denied: 'cache/extractions/v1/...' Solution: Ensure cache directory is writable:

chmod -R u+w data/guantanamo/entities/cache/

Disk Full

Symptom: Cache writes fail with OSError: [Errno 28] No space left on device Solution: Delete old cache versions or disable caching:

rm -rf data/*/entities/cache/extractions/v1

Best Practices

Version on Schema Changes

Bump cache.extraction.version whenever you modify extraction prompts or entity schemas.

Monitor Hit Rate

Log cache stats after runs to detect unintended invalidations.

Keep Multiple Versions

Don’t delete old cache versions immediately — useful for A/B comparisons.

Backup Before Major Changes

Copy cache directory before schema migrations to enable rollback.

Next Steps

Performance

Combine caching with concurrency tuning for maximum throughput

Quality Controls

Cached results skip LLM but still run deterministic QC

Local Models

Cache works identically for cloud and local models

Privacy Mode

Cache is always local (no privacy implications)

Get Started

Core Concepts

Guides

Advanced

​Overview

​Cache Architecture

​Directory Structure

​Cache Key Computation

​Key Components

​Configuration

​Configuration Options

​cache.enabled

​cache.extraction.enabled

​cache.extraction.version

​cache.extraction.subdir

​cache.articles.skip_if_unchanged

​Usage in Pipeline

​Cache Hit Flow

​Cache Read/Write

​Cache Record Structure

​Example Cache File

​Cache Statistics

​Viewing Cache Stats

​Cache Invalidation Strategies

​Full Invalidation (Version Bump)

​Selective Invalidation (Delete Specific Files)

​Disable Caching (Testing)

​Performance Impact

​Cache Hit Speedup

​Disk Usage

​Cache Maintenance

​Cleanup Old Versions

​Inspect Cache Contents

​Migrate Cache Between Machines

​Troubleshooting

​Cache Not Working

​Low Hit Rate

​Permission Errors

​Disk Full

​Best Practices

Version on Schema Changes

Monitor Hit Rate

Keep Multiple Versions

Backup Before Major Changes

​Next Steps

Performance

Quality Controls

Local Models

Privacy Mode

Build docs developers (and LLMs) love

Overview

Cache Architecture

Directory Structure

Cache Key Computation

Key Components

Configuration

Configuration Options

`cache.enabled`

`cache.extraction.enabled`

`cache.extraction.version`

`cache.extraction.subdir`

`cache.articles.skip_if_unchanged`

Usage in Pipeline

Cache Hit Flow

Cache Read/Write

Cache Record Structure

Example Cache File

Cache Statistics

Viewing Cache Stats

Cache Invalidation Strategies

Full Invalidation (Version Bump)

Selective Invalidation (Delete Specific Files)

Disable Caching (Testing)

Performance Impact

Cache Hit Speedup

Disk Usage

Cache Maintenance

Cleanup Old Versions

Inspect Cache Contents

Migrate Cache Between Machines

Troubleshooting

Cache Not Working

Low Hit Rate

Permission Errors

Disk Full

Best Practices

Next Steps