Overview
Extract entities and relations from documents using LLM-based extraction. Processes documents in a directory, chunks them, and uses configured LLM to identify entities and relationships.Usage
Arguments
Directory containing documents to process. Must be a valid directory path.
Options
Model & Domain
LLM model to use for extraction (e.g.,
openai/gpt-4o-mini, anthropic/claude-3-5-sonnet-20241022). Overrides default from config.Path to custom domain YAML file defining entity and relation types.
Bundled domain name to use (e.g.,
general, osint, academic). Use -d as shorthand.Performance & Cost
Characters per chunk. Larger chunks = fewer API calls and lower cost, but may reduce extraction quality for very large documents.
Concurrent LLM calls per document. Use
-c as shorthand. Higher values speed up processing but may hit rate limits.Maximum requests per minute to prevent rate limit waste and throttle API calls.
Maximum cost budget in USD. Extraction stops when this limit is reached.
Extraction Settings
Re-extract all documents, ignoring cached results. Use
-f as shorthand.Extraction backend to use.
kreuzberg supports 75+ formats, pdfplumber is PDF-only.Enable OCR for scanned documents and images.
OCR backend to use.
gcv = Google Cloud Vision (requires API key).OCR language code in ISO 639-3 format (e.g.,
eng for English, fra for French, deu for German).Output
Output directory for extraction results. Use
-o as shorthand. Defaults to value in config.Enable verbose logging with debug-level output. Use
-v as shorthand.Output
Extraction results are saved to{output_dir}/extractions/ with one JSON file per document.
In schema-free mode, discovered entity and relation types are saved to {output_dir}/discovered_domain.yaml.
Examples
Basic extraction
./documents using default settings.
With custom domain
With OCR for scanned documents
Cost-limited extraction
High-performance extraction
Output Summary
After completion, displays:- Documents processed
- Total entities extracted
- Total relations extracted
- Total cost in USD
- Output location