Entity Resolution Workflow

Entity resolution identifies when multiple extracted entities refer to the same real-world thing (e.g., “Dr. Smith”, “John Smith”, and “J. Smith” are all the same person). This workflow uses LLMs to propose merges, then lets you review and apply them.

Workflow Overview

Find Duplicates

sift resolve uses LLM analysis to identify likely duplicates

Review Proposals

sift review presents proposals interactively for approval/rejection

Apply Merges

sift apply-merges consolidates approved entities and cleans up rejected relations

Step 1: Find Duplicates

sift resolve

This analyzes entities in graph_data.json and generates merge_proposals.yaml.

Command Options

--model

string

LLM model for resolution (defaults to SIFT_DEFAULT_MODEL)

--domain

path

Path to domain YAML (for system context)

--domain-name

string

default:"schema-free"

Bundled domain name

-c, --concurrency

integer

default:"4"

Concurrent LLM calls for faster resolution

--rpm

integer

default:"40"

Max requests per minute

--embeddings

boolean

Use semantic clustering instead of alphabetical batching (requires pip install sift-kg[embeddings])

-o, --output

path

Output directory (defaults to output/)

-v, --verbose

boolean

Verbose logging

How It Works

Group by Type

Entities are grouped by type (PERSON, ORGANIZATION, etc.)

Sort for Clustering

PERSON entities sorted by surname (“Bradley Edwards”, “Detective Edwards”, “Mr. Edwards” cluster together)
Other types sorted alphabetically

Batch Processing

Large entity lists split into overlapping batches (100 entities per batch, 20 entity overlap)

LLM Analysis

Each batch sent to LLM which identifies:

Duplicates: Same entity with name variations
Variants: Related entities with EXTENDS relationship (e.g., “deep learning” extends “machine learning”)

Cross-Type Dedup

Finds entities with identical names but different types (no LLM needed)

Example Output

$ sift resolve

Domain: schema-free (discovered)
Graph: 1,247 entities, 3,891 relations

Resolving 342 PERSON entities
  Batch 1/4: 100 entities
  Batch 2/4: 100 entities
  Batch 3/4: 100 entities
  Batch 4/4: 62 entities
Resolving 189 ORGANIZATION entities
  Batch 1/2: 100 entities
  Batch 2/2: 109 entities

Found 47 merge proposals
Found 12 variant relationships (EXTENDS)
  Cost: $1.23
  Output: output/

Next: sift review to approve/reject merges and relations
  Then: sift apply-merges

Step 2: Review Proposals

sift review

Interactive terminal UI for reviewing merge proposals and flagged relations.

Command Options

--auto-approve

float

default:"0.85"

Auto-confirm proposals where all members have confidence ≥ this threshold. Set to 1.0 to disable.

--auto-reject

float

default:"0.5"

Auto-reject relations with confidence below this threshold. Set to 0.0 to disable.

-o, --output

path

Output directory containing merge_proposals.yaml and relation_review.yaml

Interactive Review

For each proposal, you see:

╭─ Merge 1/47 ─────────────────────────────────────────────╮
│ Merge into: John Smith  (person:john_smith)              │
│ Type: PERSON                                             │
│                                                          │
│       Members to merge                                   │
│   Member               ID                  Confidence    │
│   J. Smith             person:j_smith          95%       │
│   Dr. Smith            person:dr_smith         90%       │
│   Smith                person:smith            85%       │
│                                                          │
│ Reason: Same person with title/initial variations        │
╰──────────────────────────────────────────────────────────╯

  [a]pprove  [r]eject  [s]kip  [q]uit →

Controls:

a — Approve merge (status → CONFIRMED)
r — Reject merge (status → REJECTED)
s — Skip for now (status stays DRAFT)
q — Quit and save progress

Auto-Approve/Reject

High-confidence proposals are auto-approved before interactive review:

Auto-approved 23 proposals (all members ≥ 85% confidence)

Entity Merge Review — 24 proposals to review

Similarly, very low-confidence relations are auto-rejected.

Merge Proposal File

Proposals are saved to merge_proposals.yaml:

proposals:
  - canonical_id: person:john_smith
    canonical_name: John Smith
    entity_type: PERSON
    status: CONFIRMED
    members:
      - id: person:j_smith
        name: J. Smith
        confidence: 0.95
      - id: person:dr_smith
        name: Dr. Smith
        confidence: 0.9
    reason: Same person with title/initial variations

  - canonical_id: organization:acme_corporation
    canonical_name: Acme Corporation
    entity_type: ORGANIZATION
    status: DRAFT
    members:
      - id: organization:acme_corp
        name: Acme Corp
        confidence: 0.85
      - id: organization:acme
        name: ACME
        confidence: 0.8
    reason: Acronym and abbreviation of same company

Status values:

DRAFT — Not yet reviewed
CONFIRMED — User approved, will be applied
REJECTED — User rejected, will be ignored

Manual Editing

You can edit the YAML file directly:

# Change status to approve/reject
status: CONFIRMED

# Remove members that shouldn't merge
members:
  - id: person:j_smith
    name: J. Smith
    confidence: 0.95
  # Removed person:dr_smith - actually a different person

# Add your own reason
reason: Verified in company directory - same person

Relation Review

Flagged relations (from sift build or sift resolve) appear in relation_review.yaml:

review_threshold: 0.7
relations:
  - source_id: concept:deep_learning
    source_name: deep learning
    target_id: concept:machine_learning
    target_name: machine learning
    relation_type: EXTENDS
    confidence: 0.75
    evidence: "Deep learning is a subset of machine learning"
    status: DRAFT
    flag_reason: "Variant relationship discovered during entity resolution"

  - source_id: person:john_smith
    source_name: John Smith
    target_id: organization:acme_corp
    target_name: Acme Corp
    relation_type: WORKS_FOR
    confidence: 0.62
    evidence: "Smith mentioned Acme in passing"
    source_document: document3
    status: DRAFT
    flag_reason: "Low confidence (0.62 < 0.7)"

During sift review, you approve or reject each relation:

╭─ Relation 1/12 ──────────────────────────────────────────╮
│ deep learning  —[EXTENDS]→  machine learning             │
╰─ Variant relationship | confidence: 75% | from: doc1 ────╯
  Evidence: Deep learning is a subset of machine learning

  [a]pprove  [r]eject  [s]kip  [q]uit →

Step 3: Apply Merges

sift apply-merges

Applies all CONFIRMED merges and removes REJECTED relations.

What Happens During Apply

Merge Entities

All member entities merged into canonical entity
Attributes combined (lists merged, highest confidence for conflicts)
Source documents tracked from all members

Redirect Relations

Relations pointing to merged entities updated to point to canonical
Duplicate relations consolidated with highest confidence

Remove Rejected Relations

Relations marked REJECTED in relation_review.yaml are deleted

Update Graph

Modified graph saved to graph_data.json
Statistics displayed

Example Output

$ sift apply-merges

Graph: 1,247 entities, 3,891 relations
  Entity merges applied: 47
  Relations rejected: 8

Graph updated!
  Entities: 1,200 (47 merged)
  Relations: 3,883 (8 rejected)

Next: sift narrate to generate narrative summary

Complete Example Workflow

# 1. Extract entities from documents
sift extract ./documents

# 2. Build knowledge graph
sift build

# 3. Find duplicates
sift resolve

# 4. Review and approve/reject
sift review

# 5. Apply approved merges
sift apply-merges

# 6. Visualize cleaned graph
sift view

Advanced: Semantic Clustering

Use embeddings for smarter entity grouping:

# Install embedding dependencies
pip install sift-kg[embeddings]

# Use semantic clustering
sift resolve --embeddings

This groups entities by meaning rather than alphabetically:

“neural networks” clusters with “deep learning” (not with “networks”)
“CEO” clusters with “chief executive” (not with “CFO”)

More accurate but slower and requires ~500MB model download.

Tips for Better Resolution

Use Specific Models

Better models produce more accurate merge proposals:

sift resolve --model anthropic/claude-3-5-sonnet

Provide Domain Context

Add system context to your domain YAML so the LLM understands your entity types:

system_context: |
  This is a corporate knowledge base tracking executives,
  companies, and M&A transactions in the technology sector.

Iterative Resolution

Run sift resolve → sift review → sift apply-merges multiple times. Each iteration improves graph quality.

Manual YAML Edits

For bulk operations, edit merge_proposals.yaml directly in your editor. Change all matching patterns to CONFIRMED or REJECTED.

Troubleshooting

No proposals found

This is normal for:

Small graphs (fewer than 50 entities)
Consistent entity naming in sources
After previous resolution passes

Too many false positives

Lower confidence threshold or use better model:

sift review --auto-approve 1.0  # Manual review everything
sift resolve --model openai/gpt-4o  # More accurate proposals

Embeddings import error

pip install sift-kg[embeddings]

Or fall back to alphabetical batching:

sift resolve  # Works without embeddings

“Graph not found”

Run sift build first to create graph_data.json.

Next Steps

Visualize Graph

Explore your cleaned knowledge graph

Generate Narrative

Create human-readable summaries

Export Data

Export to external tools for analysis

Get Started

Core Concepts

Guides

Examples

Entity Resolution Workflow

Workflow Overview

Step 1: Find Duplicates

Command Options

How It Works

Example Output

Step 2: Review Proposals

Command Options

Interactive Review

Auto-Approve/Reject

Merge Proposal File

Manual Editing

Relation Review

Step 3: Apply Merges

What Happens During Apply

Example Output

Complete Example Workflow

Advanced: Semantic Clustering

Tips for Better Resolution

Troubleshooting

No proposals found

Too many false positives

Embeddings import error

“Graph not found”

Next Steps

Visualize Graph

Generate Narrative

Export Data

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Examples

​Workflow Overview

​Step 1: Find Duplicates

​Command Options

​How It Works

​Example Output

​Step 2: Review Proposals

​Command Options

​Interactive Review

​Auto-Approve/Reject

​Merge Proposal File

​Manual Editing

​Relation Review

​Step 3: Apply Merges

​What Happens During Apply

​Example Output

​Complete Example Workflow

​Advanced: Semantic Clustering

​Tips for Better Resolution

​Troubleshooting

​No proposals found

​Too many false positives

​Embeddings import error

​“Graph not found”

​Next Steps

Visualize Graph

Generate Narrative

Export Data

Build docs developers (and LLMs) love

Workflow Overview

Step 1: Find Duplicates

Command Options

How It Works

Example Output

Step 2: Review Proposals

Command Options

Interactive Review

Auto-Approve/Reject

Merge Proposal File

Manual Editing

Relation Review

Step 3: Apply Merges

What Happens During Apply

Example Output

Complete Example Workflow

Advanced: Semantic Clustering

Tips for Better Resolution

Troubleshooting

No proposals found

Too many false positives

Embeddings import error

“Graph not found”

Next Steps