Skip to main content

Requirements

sift-kg requires Python 3.11 or higher. You can verify your Python version:
python --version
If you need to upgrade Python, visit python.org/downloads.

Install sift-kg

1

Install via pip

Install the base package:
pip install sift-kg
This installs the core sift-kg CLI with support for 75+ document formats (PDF, DOCX, XLSX, PPTX, HTML, EPUB, images, and more).
2

Verify installation

Check that sift-kg is installed correctly:
sift --help
You should see the available commands: extract, build, resolve, review, apply-merges, narrate, view, and more.

Optional Dependencies

sift-kg has several optional features you can enable depending on your needs.

OCR Support (Scanned PDFs)

If you need to process scanned PDFs or images, install Tesseract OCR on your system:
brew install tesseract
Once installed, enable OCR with the --ocr flag:
sift extract ./documents/ --ocr
sift-kg autodetects which PDFs need OCR — text-rich PDFs use standard extraction, only near-empty pages fall back to OCR.
By default, sift-kg uses Tesseract (local, no API keys needed). You can switch OCR engines with --ocr-backend:
  • tesseract — Default, local
  • easyocr — Local, more accurate but slower
  • paddleocr — Local, fast for Asian languages
  • gcv — Google Cloud Vision (requires credentials and sift-kg[ocr] extras)

Google Cloud Vision OCR (Optional)

For Google Cloud Vision as an alternative OCR backend:
pip install sift-kg[ocr]
Then use:
sift extract ./documents/ --ocr --ocr-backend gcv
Google Cloud Vision requires setting up GCP credentials. For most users, local Tesseract OCR (included by default) is sufficient.

Semantic Clustering (Optional)

For improved entity resolution using semantic embeddings (~2GB download for PyTorch):
pip install sift-kg[embeddings]
This enables semantic clustering during entity resolution, which groups similar entities together even if they have different spellings (e.g., “Robert Smith” and “Bob Smith”). Use it with:
sift resolve --embeddings
The embeddings feature is most useful for large graphs (1000+ entities) or when dealing with many name variations. For smaller graphs, the default alphabetical batching works well.

Install All Optional Dependencies

To install everything at once:
pip install sift-kg[all]
This includes:
  • Google Cloud Vision OCR support
  • Semantic clustering with sentence-transformers

LLM Provider Setup

sift-kg works with any LLM provider supported by LiteLLM. The most common options are:
1

Get an API key

Choose your LLM provider and get an API key:
For local/private deployment, use Ollama to run models on your own machine — no API keys or cloud services required.
2

Initialize your project

Run sift init to create configuration files:
sift init
This creates two files:
  • .env.example — Template for API keys
  • sift.yaml — Project configuration
3

Configure your API key

Copy .env.example to .env and add your API key:
cp .env.example .env
Edit .env and add your key:
.env
# Choose your provider:
SIFT_OPENAI_API_KEY=sk-...
# SIFT_ANTHROPIC_API_KEY=sk-ant-...
# SIFT_MISTRAL_API_KEY=...

# Set your default model
SIFT_DEFAULT_MODEL=openai/gpt-4o-mini
SIFT_OPENAI_API_KEY=sk-proj-...
SIFT_DEFAULT_MODEL=openai/gpt-4o-mini
Never commit your .env file to version control. The .env.example file is safe to commit — it contains no secrets.

Development Installation

If you want to contribute to sift-kg or modify the source code:
git clone https://github.com/juanceresa/sift-kg.git
cd sift-kg
pip install -e ".[dev]"
This installs sift-kg in editable mode with development dependencies (pytest, ruff).

Verify Your Setup

Check your installation and configuration:
sift info
This displays your current configuration, including:
  • Domain settings
  • Default model
  • Output directory
  • Processing stats (if you’ve run the pipeline)

Next Steps

Quick Start

Build your first knowledge graph in 5 minutes

CLI Reference

Explore all available commands

Build docs developers (and LLMs) love