Extracting Entities and Relations

The sift extract command processes documents and uses LLMs to identify entities (people, organizations, concepts, etc.) and relationships between them. This is the first step in building your knowledge graph.

Quick Start

sift extract ./documents

This processes all supported documents in the ./documents directory using default settings.

Command Options

Basic Options

Extraction Configuration

--chunk-size

integer

default:"10000"

Characters per text chunk. Larger chunks = fewer API calls and lower cost, but may reduce extraction quality for long documents.

-c, --concurrency

integer

default:"4"

Number of concurrent LLM API calls per document. Increase for faster processing (watch rate limits).

--rpm

integer

default:"40"

Maximum requests per minute to prevent rate limit errors

-f, --force

boolean

Re-extract all documents, ignoring cached results. Normally, sift skips documents that were already processed with the same model, domain, and chunk size.

--max-cost

float

Maximum cost budget in USD. Extraction stops when this limit is reached.

Document Processing

--extractor

choice

Document extraction backend:

kreuzberg (default): Supports 75+ formats including PDF, DOCX, HTML, Markdown
pdfplumber: PDF-only, better for tables

--ocr

boolean

Enable OCR for scanned documents and images

--ocr-backend

choice

OCR engine when --ocr is enabled:

tesseract (default): Free, local
easyocr: Deep learning-based, good accuracy
paddleocr: Fast, multilingual
gcv: Google Cloud Vision (requires API key)

--ocr-language

string

default:"eng"

OCR language code (ISO 639-3), e.g., eng, fra, deu, spa

-v, --verbose

boolean

Enable verbose logging for debugging

Examples

Basic Extraction

# Extract with default settings
sift extract ./documents

# Use a specific model
sift extract ./documents --model anthropic/claude-3-5-sonnet

# Use a bundled domain
sift extract ./documents --domain-name osint

Advanced Configuration

# Smaller chunks, high concurrency
sift extract ./documents \
  --chunk-size 5000 \
  --concurrency 8 \
  --model openai/gpt-4o

Working with Custom Domains

# Use your own domain definition
sift extract ./documents --domain ./my-domain.yaml

# Schema-free mode (LLM discovers entity types)
sift extract ./documents --domain-name schema-free

Output Structure

Extractions are saved to output/extractions/ with one JSON file per document:

output/
├── extractions/
│   ├── document1.json
│   ├── document2.json
│   └── document3.json
└── discovered_domain.yaml  # Only in schema-free mode

Extraction File Format

Each extraction file contains:

{
  "document_id": "document1",
  "document_path": "/path/to/document1.pdf",
  "chunks_processed": 5,
  "entities": [
    {
      "name": "John Smith",
      "entity_type": "PERSON",
      "confidence": 0.95,
      "context": "John Smith was appointed CEO in 2020",
      "attributes": {
        "role": "CEO",
        "aliases": ["J. Smith"]
      }
    }
  ],
  "relations": [
    {
      "source_entity": "John Smith",
      "target_entity": "Acme Corp",
      "relation_type": "WORKS_FOR",
      "confidence": 0.9,
      "evidence": "John Smith was appointed CEO of Acme Corp"
    }
  ],
  "cost_usd": 0.15,
  "model_used": "openai/gpt-4o-mini",
  "domain_name": "general",
  "chunk_size": 10000,
  "extracted_at": "2024-03-15T10:30:00Z"
}

Incremental Extraction

By default, sift extract is incremental: it skips documents that were already processed with the same:

Model
Domain
Chunk size

This makes it safe to run repeatedly as you add documents to your corpus.

Use --force to re-extract all documents, for example after updating your domain definition.

Schema Discovery

In schema-free mode (--domain-name schema-free), the LLM automatically discovers entity and relation types from your documents:

sift extract ./documents --domain-name schema-free

The discovered schema is saved to output/discovered_domain.yaml and reused for subsequent extractions:

name: schema-free (discovered)
entity_types:
  PERSON:
    description: Individual people
  ORGANIZATION:
    description: Companies, institutions, groups
  CONCEPT:
    description: Abstract ideas and methodologies
relation_types:
  WORKS_FOR:
    source_types: [PERSON]
    target_types: [ORGANIZATION]
  DEVELOPED:
    source_types: [PERSON, ORGANIZATION]
    target_types: [CONCEPT]

Supported Document Formats

Kreuzberg Extractor (Default)

Supports 75+ formats including:

Documents: PDF, DOCX, ODT, RTF
Web: HTML, Markdown, XML
Spreadsheets: XLSX, CSV, ODS
Presentations: PPTX, ODP
Code: Python, Java, JavaScript, etc.
Archives: ZIP (extracts contents)
Images: PNG, JPG (with --ocr)

PDFPlumber Extractor

PDF-only, optimized for:

Tables and structured data
Precise text positioning
Form extraction

sift extract ./pdfs --extractor pdfplumber

Cost Estimation

Cost depends on:

Document size (total characters)
Chunk size (fewer large chunks = lower cost)
Model used (GPT-4 > GPT-4o-mini > Claude Haiku)

Example costs (using openai/gpt-4o-mini):

10-page PDF: ~ $0.05-$ 0.15
100-page PDF: ~ $0.50-$ 1.50
1000 documents (avg 20 pages): ~ $100-$ 300

Use --max-cost to set a budget and prevent runaway costs during experimentation.

Performance Tips

Increase chunk size

Larger chunks (15000-20000) reduce API calls and cost, suitable for most documents.

Optimize concurrency

Increase --concurrency to 8-10 if your API tier allows it. Watch for rate limit errors.

Use faster models

For large corpora, use openai/gpt-4o-mini or anthropic/claude-3-5-haiku instead of flagship models.

Batch processing

Process documents in batches with --max-cost limits to track expenses.

Troubleshooting

”No supported documents found”

The directory contains no files in supported formats. Check:

File extensions (must be .pdf, .docx, .md, etc.)
Directory path is correct
Files aren’t empty

”Rate limit exceeded”

Reduce --concurrency or --rpm:

sift extract ./docs --concurrency 2 --rpm 20

Empty or poor quality extractions

Try reducing --chunk-size for better context
Use a more capable model (gpt-4o instead of gpt-4o-mini)
Check if documents are scanned images (need --ocr)
Verify domain schema matches your content

OCR not working

Install OCR dependencies:

# Tesseract (macOS)
brew install tesseract

# Tesseract (Ubuntu)
sudo apt-get install tesseract-ocr

# EasyOCR/PaddleOCR (Python)
pip install sift-kg[ocr]

Next Steps

After extraction completes:

Build Graph

Convert extractions into a unified knowledge graph

Resolve Entities

Find and merge duplicate entities

Get Started

Core Concepts

Guides

Examples

Extracting Entities and Relations

Quick Start

Command Options

Basic Options

Extraction Configuration

Document Processing

Examples

Basic Extraction

Advanced Configuration

Working with Custom Domains

Output Structure

Extraction File Format

Incremental Extraction

Schema Discovery

Supported Document Formats

Kreuzberg Extractor (Default)

PDFPlumber Extractor

Cost Estimation

Performance Tips

Troubleshooting

”No supported documents found”

”Rate limit exceeded”

Empty or poor quality extractions

OCR not working

Next Steps

Build Graph

Resolve Entities

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Examples

​Quick Start

​Command Options

​Basic Options

​Extraction Configuration

​Document Processing

​Examples

​Basic Extraction

​Advanced Configuration

​Working with Custom Domains

​Output Structure

​Extraction File Format

​Incremental Extraction

​Schema Discovery

​Supported Document Formats

​Kreuzberg Extractor (Default)

​PDFPlumber Extractor

​Cost Estimation

​Performance Tips

​Troubleshooting

​”No supported documents found”

​”Rate limit exceeded”

​Empty or poor quality extractions

​OCR not working

​Next Steps

Build Graph

Resolve Entities

Build docs developers (and LLMs) love

Quick Start

Command Options

Basic Options

Extraction Configuration

Document Processing

Examples

Basic Extraction

Advanced Configuration

Working with Custom Domains

Output Structure

Extraction File Format

Incremental Extraction

Schema Discovery

Supported Document Formats

Kreuzberg Extractor (Default)

PDFPlumber Extractor

Cost Estimation

Performance Tips

Troubleshooting

”No supported documents found”

”Rate limit exceeded”

Empty or poor quality extractions

OCR not working

Next Steps