Overview
This example demonstrates knowledge graph extraction from legal documents with OCR-scanned content, showing how sift-kg handles complex PDFs where text must be extracted via optical character recognition.View Interactive Graph
Open
examples/epstein/output/graph.html in your browserSource Document
1 PDF with 36 sections (unsealed court deposition)
Quick Start
Pipeline Output
Theoutput/ directory contains the complete pipeline results:
Pipeline Statistics
| Metric | Value |
|---|---|
| Input Document | 1 PDF, 36 sections |
| Document Type | Unsealed court deposition (OCR-scanned) |
| Initial Entities | 226 entities, 708 relations (after build + postprocess) |
| Entity Types | 93 persons, 56 locations, 24 organizations, 15 events, 2 vehicles |
| After Resolution | 190 entities, 387 relations (16 entity merges confirmed) |
| Model Used | claude-haiku-4-5-20251001 |
OCR Processing
This example requires the--ocr flag during extraction to handle scanned PDFs:
--ocr flag:
- Uses optical character recognition to extract text from scanned images
- Handles PDFs where text is not digitally encoded
- Works with low-quality scans, handwritten documents, and mixed content
- Automatically falls back to OCR when native text extraction fails
Entity Types
The deposition primarily contains:- Persons (93) — Named individuals mentioned in testimony
- Locations (56) — Places referenced in the deposition
- Organizations (24) — Companies, institutions, and groups
- Events (15) — Specific incidents and occasions
- Vehicles (2) — Named aircraft or vehicles
Resolution and Review
This example includes:Entity Merges
16 entity merges were confirmed viasift resolve and human review:
Relation Review
Therelation_review.yaml file contains flagged relations that were reviewed for accuracy:
Key Insights from the Graph
The generated narrative (narrative.md) provides:
- Overview — High-level summary of testimony content
- Entity descriptions — AI-generated profiles for key persons, locations, and organizations
- Community detection — Clustered subgraphs showing relationship networks
This example demonstrates technical capabilities for processing legal documents. The content itself is a matter of public record from unsealed court proceedings.
Re-running the Example
Option 1: Build from Existing Extractions (Free)
Use the pre-extracted entities — no LLM API calls:Option 2: Full Pipeline from Scratch
Re-extract entities from the source PDF with OCR:Use Cases
This example pattern works well for:- Legal discovery — Map entities and relationships from depositions and court documents
- Investigative research — Extract structured data from public records
- Historical archives — Process scanned documents and old records
- Due diligence — Analyze legal filings and testimony
Processing Scanned Documents
When working with OCR documents:- Test OCR quality — Preview extracted text to ensure readability
- Adjust models — Consider using more powerful models (GPT-4) for complex layouts
- Review carefully — OCR errors can propagate to entity extraction
- Use resolution — Merge entities that may have OCR-induced variations
Next Steps
Explore Other Examples
See Transformers and FTX examples
OCR Guide
Learn more about processing scanned documents