sift extract command processes documents and uses LLMs to identify entities (people, organizations, concepts, etc.) and relationships between them. This is the first step in building your knowledge graph.
Quick Start
./documents directory using default settings.
Command Options
Basic Options
Directory containing documents to process
LLM model to use (e.g.,
openai/gpt-4o-mini, anthropic/claude-3-5-sonnet). Defaults to SIFT_DEFAULT_MODEL from config.Path to custom domain YAML file defining entity and relation types
Use a bundled domain (e.g.,
general, osint, academic). Run sift domains to see all available domains.Output directory for extraction results. Defaults to
output/ in current directory.Extraction Configuration
Characters per text chunk. Larger chunks = fewer API calls and lower cost, but may reduce extraction quality for long documents.
Number of concurrent LLM API calls per document. Increase for faster processing (watch rate limits).
Maximum requests per minute to prevent rate limit errors
Re-extract all documents, ignoring cached results. Normally, sift skips documents that were already processed with the same model, domain, and chunk size.
Maximum cost budget in USD. Extraction stops when this limit is reached.
Document Processing
Document extraction backend:
kreuzberg(default): Supports 75+ formats including PDF, DOCX, HTML, Markdownpdfplumber: PDF-only, better for tables
Enable OCR for scanned documents and images
OCR engine when
--ocr is enabled:tesseract(default): Free, localeasyocr: Deep learning-based, good accuracypaddleocr: Fast, multilingualgcv: Google Cloud Vision (requires API key)
OCR language code (ISO 639-3), e.g.,
eng, fra, deu, spaEnable verbose logging for debugging
Examples
Basic Extraction
Advanced Configuration
Working with Custom Domains
Output Structure
Extractions are saved tooutput/extractions/ with one JSON file per document:
Extraction File Format
Each extraction file contains:Incremental Extraction
By default,sift extract is incremental: it skips documents that were already processed with the same:
- Model
- Domain
- Chunk size
Use
--force to re-extract all documents, for example after updating your domain definition.Schema Discovery
In schema-free mode (--domain-name schema-free), the LLM automatically discovers entity and relation types from your documents:
output/discovered_domain.yaml and reused for subsequent extractions:
Supported Document Formats
Kreuzberg Extractor (Default)
Supports 75+ formats including:- Documents: PDF, DOCX, ODT, RTF
- Web: HTML, Markdown, XML
- Spreadsheets: XLSX, CSV, ODS
- Presentations: PPTX, ODP
- Code: Python, Java, JavaScript, etc.
- Archives: ZIP (extracts contents)
- Images: PNG, JPG (with
--ocr)
PDFPlumber Extractor
PDF-only, optimized for:- Tables and structured data
- Precise text positioning
- Form extraction
Cost Estimation
Cost depends on:- Document size (total characters)
- Chunk size (fewer large chunks = lower cost)
- Model used (GPT-4 > GPT-4o-mini > Claude Haiku)
openai/gpt-4o-mini):
- 10-page PDF: ~0.15
- 100-page PDF: ~1.50
- 1000 documents (avg 20 pages): ~300
Performance Tips
Increase chunk size
Larger chunks (15000-20000) reduce API calls and cost, suitable for most documents.
Optimize concurrency
Increase
--concurrency to 8-10 if your API tier allows it. Watch for rate limit errors.Use faster models
For large corpora, use
openai/gpt-4o-mini or anthropic/claude-3-5-haiku instead of flagship models.Troubleshooting
”No supported documents found”
The directory contains no files in supported formats. Check:- File extensions (must be .pdf, .docx, .md, etc.)
- Directory path is correct
- Files aren’t empty
”Rate limit exceeded”
Reduce--concurrency or --rpm:
Empty or poor quality extractions
- Try reducing
--chunk-sizefor better context - Use a more capable model (
gpt-4oinstead ofgpt-4o-mini) - Check if documents are scanned images (need
--ocr) - Verify domain schema matches your content
OCR not working
Install OCR dependencies:Next Steps
After extraction completes:Build Graph
Convert extractions into a unified knowledge graph
Resolve Entities
Find and merge duplicate entities