Skip to main content
The sift extract command processes documents and uses LLMs to identify entities (people, organizations, concepts, etc.) and relationships between them. This is the first step in building your knowledge graph.

Quick Start

sift extract ./documents
This processes all supported documents in the ./documents directory using default settings.

Command Options

Basic Options

directory
string
required
Directory containing documents to process
--model
string
LLM model to use (e.g., openai/gpt-4o-mini, anthropic/claude-3-5-sonnet). Defaults to SIFT_DEFAULT_MODEL from config.
--domain
path
Path to custom domain YAML file defining entity and relation types
--domain-name
string
default:"schema-free"
Use a bundled domain (e.g., general, osint, academic). Run sift domains to see all available domains.
-o, --output
path
Output directory for extraction results. Defaults to output/ in current directory.

Extraction Configuration

--chunk-size
integer
default:"10000"
Characters per text chunk. Larger chunks = fewer API calls and lower cost, but may reduce extraction quality for long documents.
-c, --concurrency
integer
default:"4"
Number of concurrent LLM API calls per document. Increase for faster processing (watch rate limits).
--rpm
integer
default:"40"
Maximum requests per minute to prevent rate limit errors
-f, --force
boolean
Re-extract all documents, ignoring cached results. Normally, sift skips documents that were already processed with the same model, domain, and chunk size.
--max-cost
float
Maximum cost budget in USD. Extraction stops when this limit is reached.

Document Processing

--extractor
choice
Document extraction backend:
  • kreuzberg (default): Supports 75+ formats including PDF, DOCX, HTML, Markdown
  • pdfplumber: PDF-only, better for tables
--ocr
boolean
Enable OCR for scanned documents and images
--ocr-backend
choice
OCR engine when --ocr is enabled:
  • tesseract (default): Free, local
  • easyocr: Deep learning-based, good accuracy
  • paddleocr: Fast, multilingual
  • gcv: Google Cloud Vision (requires API key)
--ocr-language
string
default:"eng"
OCR language code (ISO 639-3), e.g., eng, fra, deu, spa
-v, --verbose
boolean
Enable verbose logging for debugging

Examples

Basic Extraction

# Extract with default settings
sift extract ./documents

# Use a specific model
sift extract ./documents --model anthropic/claude-3-5-sonnet

# Use a bundled domain
sift extract ./documents --domain-name osint

Advanced Configuration

# Smaller chunks, high concurrency
sift extract ./documents \
  --chunk-size 5000 \
  --concurrency 8 \
  --model openai/gpt-4o

Working with Custom Domains

# Use your own domain definition
sift extract ./documents --domain ./my-domain.yaml

# Schema-free mode (LLM discovers entity types)
sift extract ./documents --domain-name schema-free

Output Structure

Extractions are saved to output/extractions/ with one JSON file per document:
output/
├── extractions/
│   ├── document1.json
│   ├── document2.json
│   └── document3.json
└── discovered_domain.yaml  # Only in schema-free mode

Extraction File Format

Each extraction file contains:
{
  "document_id": "document1",
  "document_path": "/path/to/document1.pdf",
  "chunks_processed": 5,
  "entities": [
    {
      "name": "John Smith",
      "entity_type": "PERSON",
      "confidence": 0.95,
      "context": "John Smith was appointed CEO in 2020",
      "attributes": {
        "role": "CEO",
        "aliases": ["J. Smith"]
      }
    }
  ],
  "relations": [
    {
      "source_entity": "John Smith",
      "target_entity": "Acme Corp",
      "relation_type": "WORKS_FOR",
      "confidence": 0.9,
      "evidence": "John Smith was appointed CEO of Acme Corp"
    }
  ],
  "cost_usd": 0.15,
  "model_used": "openai/gpt-4o-mini",
  "domain_name": "general",
  "chunk_size": 10000,
  "extracted_at": "2024-03-15T10:30:00Z"
}

Incremental Extraction

By default, sift extract is incremental: it skips documents that were already processed with the same:
  • Model
  • Domain
  • Chunk size
This makes it safe to run repeatedly as you add documents to your corpus.
Use --force to re-extract all documents, for example after updating your domain definition.

Schema Discovery

In schema-free mode (--domain-name schema-free), the LLM automatically discovers entity and relation types from your documents:
sift extract ./documents --domain-name schema-free
The discovered schema is saved to output/discovered_domain.yaml and reused for subsequent extractions:
name: schema-free (discovered)
entity_types:
  PERSON:
    description: Individual people
  ORGANIZATION:
    description: Companies, institutions, groups
  CONCEPT:
    description: Abstract ideas and methodologies
relation_types:
  WORKS_FOR:
    source_types: [PERSON]
    target_types: [ORGANIZATION]
  DEVELOPED:
    source_types: [PERSON, ORGANIZATION]
    target_types: [CONCEPT]

Supported Document Formats

Kreuzberg Extractor (Default)

Supports 75+ formats including:
  • Documents: PDF, DOCX, ODT, RTF
  • Web: HTML, Markdown, XML
  • Spreadsheets: XLSX, CSV, ODS
  • Presentations: PPTX, ODP
  • Code: Python, Java, JavaScript, etc.
  • Archives: ZIP (extracts contents)
  • Images: PNG, JPG (with --ocr)

PDFPlumber Extractor

PDF-only, optimized for:
  • Tables and structured data
  • Precise text positioning
  • Form extraction
sift extract ./pdfs --extractor pdfplumber

Cost Estimation

Cost depends on:
  • Document size (total characters)
  • Chunk size (fewer large chunks = lower cost)
  • Model used (GPT-4 > GPT-4o-mini > Claude Haiku)
Example costs (using openai/gpt-4o-mini):
  • 10-page PDF: ~0.050.05-0.15
  • 100-page PDF: ~0.500.50-1.50
  • 1000 documents (avg 20 pages): ~100100-300
Use --max-cost to set a budget and prevent runaway costs during experimentation.

Performance Tips

1

Increase chunk size

Larger chunks (15000-20000) reduce API calls and cost, suitable for most documents.
2

Optimize concurrency

Increase --concurrency to 8-10 if your API tier allows it. Watch for rate limit errors.
3

Use faster models

For large corpora, use openai/gpt-4o-mini or anthropic/claude-3-5-haiku instead of flagship models.
4

Batch processing

Process documents in batches with --max-cost limits to track expenses.

Troubleshooting

”No supported documents found”

The directory contains no files in supported formats. Check:
  • File extensions (must be .pdf, .docx, .md, etc.)
  • Directory path is correct
  • Files aren’t empty

”Rate limit exceeded”

Reduce --concurrency or --rpm:
sift extract ./docs --concurrency 2 --rpm 20

Empty or poor quality extractions

  • Try reducing --chunk-size for better context
  • Use a more capable model (gpt-4o instead of gpt-4o-mini)
  • Check if documents are scanned images (need --ocr)
  • Verify domain schema matches your content

OCR not working

Install OCR dependencies:
# Tesseract (macOS)
brew install tesseract

# Tesseract (Ubuntu)
sudo apt-get install tesseract-ocr

# EasyOCR/PaddleOCR (Python)
pip install sift-kg[ocr]

Next Steps

After extraction completes:

Build Graph

Convert extractions into a unified knowledge graph

Resolve Entities

Find and merge duplicate entities

Build docs developers (and LLMs) love