Skip to main content

Overview

This guide walks you through creating a knowledge graph from a collection of documents. You’ll extract entities and relationships, deduplicate them, and explore the results in an interactive browser viewer.
Before you begin, make sure you’ve installed sift-kg and configured your API key.

Your First Knowledge Graph

Let’s build a knowledge graph from a folder of documents. For this example, we’ll assume you have some PDFs or text files in a ./documents/ folder.
1

Initialize your project

Create configuration files in your current directory:
sift init
This creates:
  • .env.example — Template for your API keys
  • sift.yaml — Project settings
Copy .env.example to .env and add your API key:
cp .env.example .env
Edit .env and set your API key:
.env
SIFT_OPENAI_API_KEY=sk-proj-your-key-here
SIFT_DEFAULT_MODEL=openai/gpt-4o-mini
You can override the model for any command with the --model flag:
sift extract ./documents/ --model anthropic/claude-haiku-4.5
2

Extract entities and relations

Point sift-kg at your documents folder:
sift extract ./documents/
This will:
  • Read all supported files (PDF, DOCX, XLSX, HTML, images, 75+ formats)
  • Chunk the text into manageable pieces
  • Use your LLM to extract entities and relationships
  • Save results to output/extractions/
sift extract ./documents/
Example output:
Domain: schema-free (schema-free)
Model: openai/gpt-4o-mini
Documents: 9

Extraction complete!
  Documents processed: 9/9
  Entities extracted: 777
  Relations extracted: 1201
  Total cost: $0.18
  Output: output/extractions

Next: sift build to construct the knowledge graph
Schema-free mode (the default) runs a schema discovery step — the LLM samples your documents and designs entity/relation types tailored to your corpus. The discovered schema is saved to output/discovered_domain.yaml and reused on subsequent runs.
3

Build the knowledge graph

Construct a NetworkX graph from all extractions:
sift build
This will:
  • Load all extraction results
  • Automatically deduplicate near-identical names (plurals, Unicode variants, case differences)
  • Fix reversed edge directions when the LLM swaps source/target types
  • Flag low-confidence relations for review
  • Save the graph to output/graph_data.json
Example output:
Loading: 9 extraction files

Graph built!
  Entities: 432
  Relations: 1201
  Flagged for review: 23 relations
  Output: output/graph_data.json

Next: sift resolve to find duplicate entities
4

Find and resolve duplicates

Use the LLM to find entities that likely refer to the same real-world thing:
sift resolve
This creates output/merge_proposals.yaml with proposed entity merges.Example output:
Domain: schema-free
Graph: 432 entities, 1201 relations

Found 58 merge proposals
  Cost: $0.04
  Output: output/

Next: sift review to approve/reject merges and relations
  Then: sift apply-merges
For large graphs (1000+ entities), use --embeddings to group similar entities semantically:
sift resolve --embeddings
This requires pip install sift-kg[embeddings].
5

Review and approve merges

Review the proposed merges interactively:
sift review
This walks through each proposal, showing:
  • The canonical entity
  • The proposed merge members
  • The LLM’s confidence and reasoning
For each proposal, you can:
  • Approve — Mark as CONFIRMED
  • Reject — Mark as REJECTED
  • Skip — Leave as DRAFT to review later
High-confidence proposals (>0.85) are auto-approved. You can adjust thresholds:
sift review --auto-approve 0.90 --auto-reject 0.3
Alternative: Edit output/merge_proposals.yaml directly in your text editor. Change status: DRAFT to CONFIRMED or REJECTED. This is recommended for high-accuracy use cases (genealogy, legal review).
6

Apply your decisions

Merge the confirmed entities and remove rejected relations:
sift apply-merges
Example output:
Graph: 432 entities, 1201 relations
  Entity merges applied: 45
  Relations rejected: 8

Graph updated!
  Entities: 387
  Relations: 1193

Next: sift narrate to generate narrative summary
7

Generate a narrative summary

Create a prose report with entity profiles and relationship chains:
sift narrate
This produces output/narrative.md with:
  • An overview of the graph
  • Key relationship chains between top entities
  • A timeline (when dates exist in the data)
  • Entity profiles grouped by thematic community
Example output:
Graph: 387 entities, 1193 relations
Model: openai/gpt-4o-mini

Narrative generated!
  Output: output/narrative.md
  Cost: $0.06

Pipeline complete! Review the narrative at:
  output/narrative.md
8

Explore in your browser

Open an interactive graph viewer:
sift view
This opens output/graph.html in your browser with:
  • Community regions — Colored zones grouping related entities
  • Hover preview — See entity names and connections
  • Focus mode — Double-click to isolate neighborhoods
  • Search — Find entities by name
  • Filters — Toggle by type, community, source document, confidence
  • Trail breadcrumb — Track your exploration path
sift view
Focus Mode Navigation:
  • Double-click any entity to enter focus mode
  • Arrow keys to step through connections
  • Enter/Right to shift focus to a neighbor
  • Backspace/Left to go back along your path
  • Escape to exit focus mode

Search Entities from the CLI

You can search your knowledge graph directly from the terminal:
sift search "Sam Bankman"
Example output:
1 result

  PERSON: Sam Bankman-Fried
    aka: SBF, Bankman-Fried
    Connections: 47
    Sources: ftx_company.txt, sam_bankman_fried.txt, collapse_of_ftx.txt
Show connected entities:
sift search "Caroline" -r
Show descriptions (requires running sift narrate first):
sift search "FTX" -d -t ORGANIZATION

Export Your Graph

Export to various formats for use in other tools:
sift export graphml
Use GraphML/GEXF when you want to control node sizing, edge weighting, custom color schemes, or apply graph algorithms (centrality, community detection) in dedicated tools. SQLite is useful for ad-hoc SQL queries or publishing with Datasette.

Complete Pipeline Example

Here’s the full workflow in one go:
# Initialize project
sift init
cp .env.example .env
# (Edit .env with your API key)

# Extract entities and relations
sift extract ./documents/

# Build the knowledge graph
sift build

# Find duplicate entities
sift resolve

# Review and approve merges interactively
sift review

# Apply confirmed merges
sift apply-merges

# Generate narrative summary
sift narrate

# Explore in browser
sift view

# Export to GraphML for Gephi/Cytoscape
sift export graphml

Real-World Examples

Explore these knowledge graphs generated entirely by sift-kg:

Transformers Papers

12 foundational AI papers mapped as a concept graph
  • 425 entities
  • Cost: ~$0.72
  • Domain: academic

FTX Collapse

The FTX cryptocurrency exchange collapse from 9 articles
  • 431 entities
  • Domain: osint

Epstein Depositions

Giuffre v. Maxwell depositions extracted from a scanned PDF
  • 190 entities
  • Used OCR for scanned documents
Explore all three live — no install, no API key required.

Advanced Configuration

You can configure defaults in sift.yaml so you don’t need flags on every command:
sift.yaml
# Domain — bundled name or path to custom YAML
domain: osint

# Default LLM model
model: openai/gpt-4o-mini

# Output directory
output: output

# Enable OCR for scanned PDFs
ocr: true

# Extraction settings
extraction:
  backend: kreuzberg         # kreuzberg (default, 75+ formats) | pdfplumber
  ocr_backend: tesseract     # tesseract | easyocr | paddleocr | gcv
  ocr_language: eng          # ISO 639-3 language code
Settings priority: CLI flags > env vars > .env > sift.yaml > defaults You can override anything from sift.yaml with a flag:
sift extract ./documents/ --domain-name academic --model anthropic/claude-haiku-4.5

Using Bundled Domains

sift-kg ships with specialized domains:

schema-free (default)

Auto-discovers entity and relation types from your data

general

PERSON, ORGANIZATION, LOCATION, EVENT, DOCUMENT

osint

Investigations: SHELL_COMPANY, FINANCIAL_ACCOUNT, beneficial ownership

academic

Literature review: CONCEPT, THEORY, METHOD, FINDING, PUBLICATION
List all available domains:
sift domains
Use a bundled domain:
sift extract ./documents/ --domain-name osint
Or set it in sift.yaml:
sift.yaml
domain: academic

Next Steps

Core Concepts

Understand how sift-kg processes documents

CLI Reference

Explore all available commands

Domains

Learn about schema-free vs. structured domains

Entity Resolution

Deep dive into the deduplication workflow

Build docs developers (and LLMs) love