Entity resolution identifies when multiple extracted entities refer to the same real-world thing (e.g., “Dr. Smith”, “John Smith”, and “J. Smith” are all the same person). This workflow uses LLMs to propose merges, then lets you review and apply them.
Workflow Overview
Find Duplicates
sift resolve uses LLM analysis to identify likely duplicates
Review Proposals
sift review presents proposals interactively for approval/rejection
Apply Merges
sift apply-merges consolidates approved entities and cleans up rejected relations
Step 1: Find Duplicates
This analyzes entities in graph_data.json and generates merge_proposals.yaml.
Command Options
LLM model for resolution (defaults to SIFT_DEFAULT_MODEL)
Path to domain YAML (for system context)
--domain-name
string
default: "schema-free"
Bundled domain name
Concurrent LLM calls for faster resolution
Use semantic clustering instead of alphabetical batching (requires pip install sift-kg[embeddings])
Output directory (defaults to output/)
How It Works
Group by Type
Entities are grouped by type (PERSON, ORGANIZATION, etc.)
Sort for Clustering
PERSON entities sorted by surname (“Bradley Edwards”, “Detective Edwards”, “Mr. Edwards” cluster together)
Other types sorted alphabetically
Batch Processing
Large entity lists split into overlapping batches (100 entities per batch, 20 entity overlap)
LLM Analysis
Each batch sent to LLM which identifies:
Duplicates : Same entity with name variations
Variants : Related entities with EXTENDS relationship (e.g., “deep learning” extends “machine learning”)
Cross-Type Dedup
Finds entities with identical names but different types (no LLM needed)
Example Output
$ sift resolve
Domain: schema-free (discovered)
Graph: 1,247 entities, 3,891 relations
Resolving 342 PERSON entities
Batch 1/4: 100 entities
Batch 2/4: 100 entities
Batch 3/4: 100 entities
Batch 4/4: 62 entities
Resolving 189 ORGANIZATION entities
Batch 1/2: 100 entities
Batch 2/2: 109 entities
Found 47 merge proposals
Found 12 variant relationships (EXTENDS)
Cost: $1.23
Output: output/
Next: sift review to approve/reject merges and relations
Then: sift apply-merges
Step 2: Review Proposals
Interactive terminal UI for reviewing merge proposals and flagged relations.
Command Options
Auto-confirm proposals where all members have confidence ≥ this threshold. Set to 1.0 to disable.
Auto-reject relations with confidence below this threshold. Set to 0.0 to disable.
Output directory containing merge_proposals.yaml and relation_review.yaml
Interactive Review
For each proposal, you see:
╭─ Merge 1/47 ─────────────────────────────────────────────╮
│ Merge into: John Smith (person:john_smith) │
│ Type: PERSON │
│ │
│ Members to merge │
│ Member ID Confidence │
│ J. Smith person:j_smith 95% │
│ Dr. Smith person:dr_smith 90% │
│ Smith person:smith 85% │
│ │
│ Reason: Same person with title/initial variations │
╰──────────────────────────────────────────────────────────╯
[a]pprove [r]eject [s]kip [q]uit →
Controls:
a — Approve merge (status → CONFIRMED)
r — Reject merge (status → REJECTED)
s — Skip for now (status stays DRAFT)
q — Quit and save progress
Auto-Approve/Reject
High-confidence proposals are auto-approved before interactive review:
Auto-approved 23 proposals (all members ≥ 85% confidence)
Entity Merge Review — 24 proposals to review
Similarly, very low-confidence relations are auto-rejected.
Merge Proposal File
Proposals are saved to merge_proposals.yaml:
proposals :
- canonical_id : person:john_smith
canonical_name : John Smith
entity_type : PERSON
status : CONFIRMED
members :
- id : person:j_smith
name : J. Smith
confidence : 0.95
- id : person:dr_smith
name : Dr. Smith
confidence : 0.9
reason : Same person with title/initial variations
- canonical_id : organization:acme_corporation
canonical_name : Acme Corporation
entity_type : ORGANIZATION
status : DRAFT
members :
- id : organization:acme_corp
name : Acme Corp
confidence : 0.85
- id : organization:acme
name : ACME
confidence : 0.8
reason : Acronym and abbreviation of same company
Status values:
DRAFT — Not yet reviewed
CONFIRMED — User approved, will be applied
REJECTED — User rejected, will be ignored
Manual Editing
You can edit the YAML file directly:
# Change status to approve/reject
status : CONFIRMED
# Remove members that shouldn't merge
members :
- id : person:j_smith
name : J. Smith
confidence : 0.95
# Removed person:dr_smith - actually a different person
# Add your own reason
reason : Verified in company directory - same person
Relation Review
Flagged relations (from sift build or sift resolve) appear in relation_review.yaml:
review_threshold : 0.7
relations :
- source_id : concept:deep_learning
source_name : deep learning
target_id : concept:machine_learning
target_name : machine learning
relation_type : EXTENDS
confidence : 0.75
evidence : "Deep learning is a subset of machine learning"
status : DRAFT
flag_reason : "Variant relationship discovered during entity resolution"
- source_id : person:john_smith
source_name : John Smith
target_id : organization:acme_corp
target_name : Acme Corp
relation_type : WORKS_FOR
confidence : 0.62
evidence : "Smith mentioned Acme in passing"
source_document : document3
status : DRAFT
flag_reason : "Low confidence (0.62 < 0.7)"
During sift review, you approve or reject each relation:
╭─ Relation 1/12 ──────────────────────────────────────────╮
│ deep learning —[EXTENDS]→ machine learning │
╰─ Variant relationship | confidence: 75% | from: doc1 ────╯
Evidence: Deep learning is a subset of machine learning
[a]pprove [r]eject [s]kip [q]uit →
Step 3: Apply Merges
Applies all CONFIRMED merges and removes REJECTED relations.
What Happens During Apply
Merge Entities
All member entities merged into canonical entity
Attributes combined (lists merged, highest confidence for conflicts)
Source documents tracked from all members
Redirect Relations
Relations pointing to merged entities updated to point to canonical
Duplicate relations consolidated with highest confidence
Remove Rejected Relations
Relations marked REJECTED in relation_review.yaml are deleted
Update Graph
Modified graph saved to graph_data.json
Statistics displayed
Example Output
$ sift apply-merges
Graph: 1,247 entities, 3,891 relations
Entity merges applied: 47
Relations rejected: 8
Graph updated!
Entities: 1,200 (47 merged)
Relations: 3,883 (8 rejected)
Next: sift narrate to generate narrative summary
Complete Example Workflow
Full Workflow
Automated (High Confidence Only)
Careful Review (Manual)
# 1. Extract entities from documents
sift extract ./documents
# 2. Build knowledge graph
sift build
# 3. Find duplicates
sift resolve
# 4. Review and approve/reject
sift review
# 5. Apply approved merges
sift apply-merges
# 6. Visualize cleaned graph
sift view
Advanced: Semantic Clustering
Use embeddings for smarter entity grouping:
# Install embedding dependencies
pip install sift-kg[embeddings]
# Use semantic clustering
sift resolve --embeddings
This groups entities by meaning rather than alphabetically:
“neural networks” clusters with “deep learning” (not with “networks”)
“CEO” clusters with “chief executive” (not with “CFO”)
More accurate but slower and requires ~500MB model download.
Tips for Better Resolution
Use Specific Models
Better models produce more accurate merge proposals: sift resolve --model anthropic/claude-3-5-sonnet
Provide Domain Context
Add system context to your domain YAML so the LLM understands your entity types: system_context : |
This is a corporate knowledge base tracking executives,
companies, and M&A transactions in the technology sector.
Iterative Resolution
Run sift resolve → sift review → sift apply-merges multiple times.
Each iteration improves graph quality.
Manual YAML Edits
For bulk operations, edit merge_proposals.yaml directly in your editor.
Change all matching patterns to CONFIRMED or REJECTED.
Troubleshooting
No proposals found
This is normal for:
Small graphs (fewer than 50 entities)
Consistent entity naming in sources
After previous resolution passes
Too many false positives
Lower confidence threshold or use better model:
sift review --auto-approve 1.0 # Manual review everything
sift resolve --model openai/gpt-4o # More accurate proposals
Embeddings import error
pip install sift-kg[embeddings]
Or fall back to alphabetical batching:
sift resolve # Works without embeddings
“Graph not found”
Run sift build first to create graph_data.json.
Next Steps
Visualize Graph Explore your cleaned knowledge graph
Generate Narrative Create human-readable summaries
Export Data Export to external tools for analysis