Skip to main content
Complete sift-kg pipeline output from 12 foundational academic papers on transformer architectures and large language models.

Overview

This example demonstrates knowledge graph extraction from academic research papers, producing a graph of 425 entities and 1,122 relations across systems, concepts, researchers, methods, phenomena, and findings.

View Interactive Graph

Open examples/transformers/output/graph.html in your browser

Source Papers

12 PDFs including “Attention Is All You Need”, BERT, GPT-2, GPT-3, ViT, DALL-E

Quick Start

# View the interactive graph (no installation needed)
open examples/transformers/output/graph.html     # macOS
xdg-open examples/transformers/output/graph.html # Linux

# Or use sift's built-in viewer
sift view -o examples/transformers/output

Pipeline Output

The output/ directory contains the complete pipeline results:
transformers/
├── docs/                          # 12 source papers (PDF)
├── sift.yaml                      # Domain configuration (academic)
└── output/
    ├── extractions/               # Per-document entity+relation JSON from LLM
    ├── graph_data.json            # Knowledge graph (425 entities, 1122 relations)
    ├── communities.json           # Detected graph communities
    ├── entity_descriptions.json   # AI-generated entity descriptions
    ├── narrative.md               # Prose narrative with entity profiles
    └── graph.html                 # Interactive pyvis graph viewer

Pipeline Statistics

MetricValue
Input Documents12 PDFs
Source PapersAttention Is All You Need, BERT, GPT-2, GPT-3, ViT, DALL-E, and more
Total Entities425
Total Relations1,122
Entity Types118 systems, 73 concepts, 71 researchers, 70 methods, 34 phenomena, 25 findings
Domainacademic (bundled with sift-kg)
Model Usedclaude-haiku-4-5-20251001
Total Cost~$0.72

Domain Configuration

This example uses the built-in academic domain (defined in sift.yaml):
domain: academic
The academic domain includes entity types optimized for research papers:
  • System — Models, frameworks, architectures (e.g., “GPT-3”, “BERT”)
  • Concept — Theoretical ideas and principles
  • Researcher — Authors and cited researchers
  • Method — Techniques and approaches
  • Phenomenon — Observed patterns and behaviors
  • Finding — Research results and conclusions

Key Insights from the Graph

The generated narrative (narrative.md) provides:
  • Overview — Synthesis of how transformer architectures evolved
  • Entity descriptions — AI-generated profiles for key systems, researchers, and concepts
  • Community detection — Clustered subgraphs (e.g., vision transformers, language models, attention mechanisms)

Re-running the Example

Option 1: Build from Existing Extractions (Free)

Use the pre-extracted entities — no LLM API calls:
pip install sift-kg
sift build -o examples/transformers/output
sift view -o examples/transformers/output

Option 2: Full Pipeline from Scratch

Re-extract entities from the source PDFs:
# Extract entities from papers
sift extract examples/transformers/docs \
  --model openai/gpt-4o-mini \
  -o my-output \
  --domain academic

# Build the knowledge graph
sift build -o my-output

# Resolve duplicate entities
sift resolve -o my-output --model openai/gpt-4o-mini

# Review and apply merges
sift review -o my-output
sift apply-merges -o my-output

# Generate narrative
sift narrate -o my-output --model openai/gpt-4o-mini

# View the result
sift view -o my-output

Option 3: Use Your Own Papers

mkdir my-papers
cp your-pdfs/*.pdf my-papers/

sift extract my-papers --model openai/gpt-4o-mini -o my-output --domain academic
sift build -o my-output
sift resolve -o my-output --model openai/gpt-4o-mini
sift review -o my-output
sift apply-merges -o my-output
sift narrate -o my-output --model openai/gpt-4o-mini
sift view -o my-output
Use --domain academic when extracting from research papers to get entity types optimized for scholarly content.

Cost Breakdown

The ~$0.72 total cost includes:
  • Extraction — LLM calls to extract entities and relations from each PDF
  • Resolution — LLM calls to identify duplicate entities
  • Narration — LLM calls to generate entity descriptions and overview
Costs will vary based on:
  • Document length and complexity
  • Model chosen (Haiku vs GPT-4o-mini vs others)
  • Number of resolution passes needed

Use Cases

This example pattern works well for:
  • Literature reviews — Map research landscape across papers
  • Citation analysis — Track how researchers and concepts connect
  • Technology evolution — See how systems and methods build on each other
  • Knowledge synthesis — Generate summaries across multiple papers

Next Steps

Explore Other Examples

See FTX and Epstein examples

Domain Configuration

Learn how to customize entity types

Build docs developers (and LLMs) love