sift extract

Overview

Extract entities and relations from documents using LLM-based extraction. Processes documents in a directory, chunks them, and uses configured LLM to identify entities and relationships.

Usage

sift extract DIRECTORY [OPTIONS]

Arguments

Options

Model & Domain

--model

string

LLM model to use for extraction (e.g., openai/gpt-4o-mini, anthropic/claude-3-5-sonnet-20241022). Overrides default from config.

--domain

string

Path to custom domain YAML file defining entity and relation types.

--domain-name

string

default:"schema-free"

Bundled domain name to use (e.g., general, osint, academic). Use -d as shorthand.

Performance & Cost

--chunk-size

integer

default:"10000"

Characters per chunk. Larger chunks = fewer API calls and lower cost, but may reduce extraction quality for very large documents.

--concurrency

integer

default:"4"

Concurrent LLM calls per document. Use -c as shorthand. Higher values speed up processing but may hit rate limits.

--rpm

integer

default:"40"

Maximum requests per minute to prevent rate limit waste and throttle API calls.

--max-cost

float

Maximum cost budget in USD. Extraction stops when this limit is reached.

Extraction Settings

--force

boolean

default:"false"

Re-extract all documents, ignoring cached results. Use -f as shorthand.

--extractor

string

Extraction backend to use. kreuzberg supports 75+ formats, pdfplumber is PDF-only.

--ocr

boolean

default:"false"

Enable OCR for scanned documents and images.

--ocr-backend

string

OCR backend to use. gcv = Google Cloud Vision (requires API key).

--ocr-language

string

OCR language code in ISO 639-3 format (e.g., eng for English, fra for French, deu for German).

Output

--output

string

Output directory for extraction results. Use -o as shorthand. Defaults to value in config.

--verbose

boolean

default:"false"

Enable verbose logging with debug-level output. Use -v as shorthand.

Output

Extraction results are saved to {output_dir}/extractions/ with one JSON file per document. In schema-free mode, discovered entity and relation types are saved to {output_dir}/discovered_domain.yaml.

Examples

Basic extraction

sift extract ./documents

Extracts entities from all documents in ./documents using default settings.

With custom domain

sift extract ./documents --domain-name osint

Uses the bundled OSINT domain for specialized entity extraction.

With OCR for scanned documents

sift extract ./scanned_pdfs --ocr --ocr-backend tesseract --ocr-language eng

Enables OCR using Tesseract for English documents.

Cost-limited extraction

sift extract ./documents --max-cost 5.0 --model openai/gpt-4o-mini

Extracts with a $5 budget using GPT-4o-mini.

High-performance extraction

sift extract ./documents -c 8 --chunk-size 15000 --rpm 60

Uses 8 concurrent workers, larger chunks, and higher rate limit.

Output Summary

After completion, displays:

Documents processed
Total entities extracted
Total relations extracted
Total cost in USD
Output location

Next Steps

After extraction, run:

sift build

To construct the knowledge graph from extraction results.

CLI Commands

Python API

Overview

Usage

Arguments

Options

Model & Domain

Performance & Cost

Extraction Settings

Output

Output

Examples

Basic extraction

With custom domain

With OCR for scanned documents

Cost-limited extraction

High-performance extraction

Output Summary

Next Steps

See Also

Build docs developers (and LLMs) love

CLI Commands

Python API

​Overview

​Usage

​Arguments

​Options

​Model & Domain

​Performance & Cost

​Extraction Settings

​Output

​Output

​Examples

​Basic extraction

​With custom domain

​With OCR for scanned documents

​Cost-limited extraction

​High-performance extraction

​Output Summary

​Next Steps

​See Also

Build docs developers (and LLMs) love

Overview

Usage

Arguments

Options

Model & Domain

Performance & Cost

Extraction Settings

Output

Output

Examples

Basic extraction

With custom domain

With OCR for scanned documents

Cost-limited extraction

High-performance extraction

Output Summary

Next Steps

See Also