Skip to main content
Complete sift-kg pipeline output from the unsealed Giuffre v. Maxwell court deposition, demonstrating OCR extraction from a scanned PDF and producing a graph of 190 entities and 387 relations.

Overview

This example demonstrates knowledge graph extraction from legal documents with OCR-scanned content, showing how sift-kg handles complex PDFs where text must be extracted via optical character recognition.

View Interactive Graph

Open examples/epstein/output/graph.html in your browser

Source Document

1 PDF with 36 sections (unsealed court deposition)

Quick Start

# View the interactive graph (no installation needed)
open examples/epstein/output/graph.html     # macOS
xdg-open examples/epstein/output/graph.html # Linux

# Or use sift's built-in viewer
sift view -o examples/epstein/output

Pipeline Output

The output/ directory contains the complete pipeline results:
epstein/
├── docs/                          # Source PDF (unsealed court deposition)
└── output/
    ├── extractions/               # Per-document entity+relation JSON from LLM
    ├── graph_data.json            # Knowledge graph (226 entities, 708 relations)
    ├── merge_proposals.yaml       # Entity merge decisions (16 confirmed merges)
    ├── relation_review.yaml       # Flagged relations reviewed
    ├── communities.json           # Detected graph communities
    ├── entity_descriptions.json   # AI-generated entity descriptions
    ├── narrative.md               # Prose narrative with entity profiles
    └── graph.html                 # Interactive pyvis graph viewer

Pipeline Statistics

MetricValue
Input Document1 PDF, 36 sections
Document TypeUnsealed court deposition (OCR-scanned)
Initial Entities226 entities, 708 relations (after build + postprocess)
Entity Types93 persons, 56 locations, 24 organizations, 15 events, 2 vehicles
After Resolution190 entities, 387 relations (16 entity merges confirmed)
Model Usedclaude-haiku-4-5-20251001

OCR Processing

This example requires the --ocr flag during extraction to handle scanned PDFs:
sift extract examples/epstein/docs --model openai/gpt-4o-mini -o my-output --ocr
The --ocr flag:
  • Uses optical character recognition to extract text from scanned images
  • Handles PDFs where text is not digitally encoded
  • Works with low-quality scans, handwritten documents, and mixed content
  • Automatically falls back to OCR when native text extraction fails
OCR processing may increase extraction time and cost. Ensure your documents actually require OCR before using this flag.

Entity Types

The deposition primarily contains:
  • Persons (93) — Named individuals mentioned in testimony
  • Locations (56) — Places referenced in the deposition
  • Organizations (24) — Companies, institutions, and groups
  • Events (15) — Specific incidents and occasions
  • Vehicles (2) — Named aircraft or vehicles
These entity types are automatically detected based on the content — no custom domain configuration was needed.

Resolution and Review

This example includes:

Entity Merges

16 entity merges were confirmed via sift resolve and human review:
merges:
  - primary: "Jeffrey Epstein"
    duplicates:
      - "Epstein"
      - "Mr. Epstein"
    status: CONFIRMED

Relation Review

The relation_review.yaml file contains flagged relations that were reviewed for accuracy:
relations:
  - source: "Person A"
    target: "Location B"
    relation_type: "VISITED"
    status: CONFIRMED
    notes: "Explicitly stated in testimony"
This demonstrates sift-kg’s support for reviewing potentially sensitive or ambiguous relationships.

Key Insights from the Graph

The generated narrative (narrative.md) provides:
  • Overview — High-level summary of testimony content
  • Entity descriptions — AI-generated profiles for key persons, locations, and organizations
  • Community detection — Clustered subgraphs showing relationship networks
This example demonstrates technical capabilities for processing legal documents. The content itself is a matter of public record from unsealed court proceedings.

Re-running the Example

Option 1: Build from Existing Extractions (Free)

Use the pre-extracted entities — no LLM API calls:
pip install sift-kg
sift build -o examples/epstein/output
sift view -o examples/epstein/output

Option 2: Full Pipeline from Scratch

Re-extract entities from the source PDF with OCR:
# Extract entities with OCR support
sift extract examples/epstein/docs \
  --model openai/gpt-4o-mini \
  -o my-output \
  --ocr

# Build the knowledge graph
sift build -o my-output

# Resolve duplicate entities
sift resolve -o my-output --model openai/gpt-4o-mini

# Review proposed merges
sift review -o my-output

# Apply confirmed merges
sift apply-merges -o my-output

# Generate narrative
sift narrate -o my-output --model openai/gpt-4o-mini

# View the result
sift view -o my-output
Always use the --ocr flag when working with scanned documents, court filings, or any PDF where you cannot select and copy text.

Use Cases

This example pattern works well for:
  • Legal discovery — Map entities and relationships from depositions and court documents
  • Investigative research — Extract structured data from public records
  • Historical archives — Process scanned documents and old records
  • Due diligence — Analyze legal filings and testimony

Processing Scanned Documents

When working with OCR documents:
  1. Test OCR quality — Preview extracted text to ensure readability
  2. Adjust models — Consider using more powerful models (GPT-4) for complex layouts
  3. Review carefully — OCR errors can propagate to entity extraction
  4. Use resolution — Merge entities that may have OCR-induced variations

Next Steps

Explore Other Examples

See Transformers and FTX examples

OCR Guide

Learn more about processing scanned documents

Build docs developers (and LLMs) love