Quick Start

Overview

This guide walks you through creating a knowledge graph from a collection of documents. You’ll extract entities and relationships, deduplicate them, and explore the results in an interactive browser viewer.

Before you begin, make sure you’ve installed sift-kg and configured your API key.

Your First Knowledge Graph

Let’s build a knowledge graph from a folder of documents. For this example, we’ll assume you have some PDFs or text files in a ./documents/ folder.

Initialize your project

Create configuration files in your current directory:

sift init

This creates:

.env.example — Template for your API keys
sift.yaml — Project settings

Copy .env.example to .env and add your API key:

cp .env.example .env

Edit .env and set your API key:

.env

SIFT_OPENAI_API_KEY=sk-proj-your-key-here
SIFT_DEFAULT_MODEL=openai/gpt-4o-mini

You can override the model for any command with the --model flag:

sift extract ./documents/ --model anthropic/claude-haiku-4.5

Extract entities and relations

Point sift-kg at your documents folder:

sift extract ./documents/

This will:

Read all supported files (PDF, DOCX, XLSX, HTML, images, 75+ formats)
Chunk the text into manageable pieces
Use your LLM to extract entities and relationships
Save results to output/extractions/

sift extract ./documents/

Example output:

Domain: schema-free (schema-free)
Model: openai/gpt-4o-mini
Documents: 9

Extraction complete!
  Documents processed: 9/9
  Entities extracted: 777
  Relations extracted: 1201
  Total cost: $0.18
  Output: output/extractions

Next: sift build to construct the knowledge graph

Schema-free mode (the default) runs a schema discovery step — the LLM samples your documents and designs entity/relation types tailored to your corpus. The discovered schema is saved to output/discovered_domain.yaml and reused on subsequent runs.

Build the knowledge graph

Construct a NetworkX graph from all extractions:

sift build

This will:

Load all extraction results
Automatically deduplicate near-identical names (plurals, Unicode variants, case differences)
Fix reversed edge directions when the LLM swaps source/target types
Flag low-confidence relations for review
Save the graph to output/graph_data.json

Example output:

Loading: 9 extraction files

Graph built!
  Entities: 432
  Relations: 1201
  Flagged for review: 23 relations
  Output: output/graph_data.json

Next: sift resolve to find duplicate entities

Find and resolve duplicates

Use the LLM to find entities that likely refer to the same real-world thing:

sift resolve

This creates output/merge_proposals.yaml with proposed entity merges.Example output:

Domain: schema-free
Graph: 432 entities, 1201 relations

Found 58 merge proposals
  Cost: $0.04
  Output: output/

Next: sift review to approve/reject merges and relations
  Then: sift apply-merges

For large graphs (1000+ entities), use --embeddings to group similar entities semantically:

sift resolve --embeddings

This requires pip install sift-kg[embeddings].

Review and approve merges

Review the proposed merges interactively:

sift review

This walks through each proposal, showing:

The canonical entity
The proposed merge members
The LLM’s confidence and reasoning

For each proposal, you can:

Approve — Mark as CONFIRMED
Reject — Mark as REJECTED
Skip — Leave as DRAFT to review later

High-confidence proposals (>0.85) are auto-approved. You can adjust thresholds:

sift review --auto-approve 0.90 --auto-reject 0.3

Alternative: Edit output/merge_proposals.yaml directly in your text editor. Change status: DRAFT to CONFIRMED or REJECTED. This is recommended for high-accuracy use cases (genealogy, legal review).

Apply your decisions

Merge the confirmed entities and remove rejected relations:

sift apply-merges

Example output:

Graph: 432 entities, 1201 relations
  Entity merges applied: 45
  Relations rejected: 8

Graph updated!
  Entities: 387
  Relations: 1193

Next: sift narrate to generate narrative summary

Generate a narrative summary

Create a prose report with entity profiles and relationship chains:

sift narrate

This produces output/narrative.md with:

An overview of the graph
Key relationship chains between top entities
A timeline (when dates exist in the data)
Entity profiles grouped by thematic community

Example output:

Graph: 387 entities, 1193 relations
Model: openai/gpt-4o-mini

Narrative generated!
  Output: output/narrative.md
  Cost: $0.06

Pipeline complete! Review the narrative at:
  output/narrative.md

Explore in your browser

Open an interactive graph viewer:

sift view

This opens output/graph.html in your browser with:

Community regions — Colored zones grouping related entities
Hover preview — See entity names and connections
Focus mode — Double-click to isolate neighborhoods
Search — Find entities by name
Filters — Toggle by type, community, source document, confidence
Trail breadcrumb — Track your exploration path

sift view

Focus Mode Navigation:

Double-click any entity to enter focus mode
Arrow keys to step through connections
Enter/Right to shift focus to a neighbor
Backspace/Left to go back along your path
Escape to exit focus mode

Search Entities from the CLI

You can search your knowledge graph directly from the terminal:

sift search "Sam Bankman"

Example output:

1 result

  PERSON: Sam Bankman-Fried
    aka: SBF, Bankman-Fried
    Connections: 47
    Sources: ftx_company.txt, sam_bankman_fried.txt, collapse_of_ftx.txt

Show connected entities:

sift search "Caroline" -r

Show descriptions (requires running sift narrate first):

sift search "FTX" -d -t ORGANIZATION

Export Your Graph

Export to various formats for use in other tools:

sift export graphml

Use GraphML/GEXF when you want to control node sizing, edge weighting, custom color schemes, or apply graph algorithms (centrality, community detection) in dedicated tools. SQLite is useful for ad-hoc SQL queries or publishing with Datasette.

Complete Pipeline Example

Here’s the full workflow in one go:

# Initialize project
sift init
cp .env.example .env
# (Edit .env with your API key)

# Extract entities and relations
sift extract ./documents/

# Build the knowledge graph
sift build

# Find duplicate entities
sift resolve

# Review and approve merges interactively
sift review

# Apply confirmed merges
sift apply-merges

# Generate narrative summary
sift narrate

# Explore in browser
sift view

# Export to GraphML for Gephi/Cytoscape
sift export graphml

Real-World Examples

Explore these knowledge graphs generated entirely by sift-kg:

Transformers Papers

12 foundational AI papers mapped as a concept graph

425 entities
Cost: ~$0.72
Domain: academic

FTX Collapse

The FTX cryptocurrency exchange collapse from 9 articles

431 entities
Domain: osint

Epstein Depositions

Giuffre v. Maxwell depositions extracted from a scanned PDF

190 entities
Used OCR for scanned documents

Explore all three live — no install, no API key required.

Advanced Configuration

You can configure defaults in sift.yaml so you don’t need flags on every command:

sift.yaml

# Domain — bundled name or path to custom YAML
domain: osint

# Default LLM model
model: openai/gpt-4o-mini

# Output directory
output: output

# Enable OCR for scanned PDFs
ocr: true

# Extraction settings
extraction:
  backend: kreuzberg         # kreuzberg (default, 75+ formats) | pdfplumber
  ocr_backend: tesseract     # tesseract | easyocr | paddleocr | gcv
  ocr_language: eng          # ISO 639-3 language code

Settings priority: CLI flags > env vars > .env > sift.yaml > defaults You can override anything from sift.yaml with a flag:

sift extract ./documents/ --domain-name academic --model anthropic/claude-haiku-4.5

Using Bundled Domains

sift-kg ships with specialized domains:

schema-free (default)

Auto-discovers entity and relation types from your data

general

PERSON, ORGANIZATION, LOCATION, EVENT, DOCUMENT

osint

Investigations: SHELL_COMPANY, FINANCIAL_ACCOUNT, beneficial ownership

academic

Literature review: CONCEPT, THEORY, METHOD, FINDING, PUBLICATION

List all available domains:

sift domains

Use a bundled domain:

sift extract ./documents/ --domain-name osint

Or set it in sift.yaml:

sift.yaml

domain: academic

Next Steps

Core Concepts

Understand how sift-kg processes documents

CLI Reference

Explore all available commands

Domains

Learn about schema-free vs. structured domains

Entity Resolution

Deep dive into the deduplication workflow

Get Started

Core Concepts

Guides

Examples

Overview

Your First Knowledge Graph

Search Entities from the CLI

Export Your Graph

Complete Pipeline Example

Real-World Examples

Transformers Papers

FTX Collapse

Epstein Depositions

Advanced Configuration

Using Bundled Domains

schema-free (default)

general

osint

academic

Next Steps

Core Concepts

CLI Reference

Domains

Entity Resolution

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Examples

​Overview

​Your First Knowledge Graph

​Search Entities from the CLI

​Export Your Graph

​Complete Pipeline Example

​Real-World Examples

Transformers Papers

FTX Collapse

Epstein Depositions

​Advanced Configuration

​Using Bundled Domains

schema-free (default)

general

osint

academic

​Next Steps

Core Concepts

CLI Reference

Domains

Entity Resolution

Build docs developers (and LLMs) love

Overview

Your First Knowledge Graph

Search Entities from the CLI

Export Your Graph

Complete Pipeline Example

Real-World Examples

Advanced Configuration

Using Bundled Domains

Next Steps