The Deduplication Challenge
When extracting entities from multiple documents, the same real-world entity often appears under different names:Layer 1: Automatic Pre-Deduplication
Before entities become graph nodes, sift-kg runs automatic deduplication duringsift build to catch obvious duplicates.
Phase 1: Deterministic Merging
Entities are normalized and grouped by: Unicode Normalization/home/daytona/workspace/source/src/sift_kg/graph/prededup.py:30 for the complete list.
Canonical Selection
When multiple variants normalize to the same form, the canonical is chosen by:
- Frequency: Most common variant wins
- Length: Longest variant (likely more complete)
- Alphabetical: First alphabetically (tiebreaker)
/home/daytona/workspace/source/src/sift_kg/graph/prededup.py:177:
Phase 2: Fuzzy Semantic Matching
After deterministic grouping, remaining unique forms are compared using SemHash semantic similarity. How SemHash Works- Convert entity names to embeddings using Model2Vec (lightweight, no GPU needed)
- Compute pairwise cosine similarities
- Merge entities with similarity ≥ 0.95 threshold
/home/daytona/workspace/source/src/sift_kg/graph/prededup.py:153:
Results
Pre-dedup typically reduces entity counts by 10-30%:Layer 2: LLM-Based Entity Resolution
After pre-dedup,sift resolve finds duplicates that require semantic understanding.
Why LLM Resolution?
Pre-dedup catches mechanical duplicates. LLM resolution catches:- Partial names: “Joseph Recarey” vs “Joe Recarey”
- Nicknames: “William” vs “Bill”
- Context-dependent equivalence: “the Transformer architecture” vs “Transformers”
- Professional vs personal names: “Dr. Alice Smith” vs “Alice”
Batching Strategy
Entities are processed in type-specific batches to improve accuracy. Sorting Entities are sorted before batching:-
PERSON entities: Sorted by surname so name variants cluster
- Other types: Alphabetically by name
/home/daytona/workspace/source/src/sift_kg/resolve/resolver.py:40:
- Batch size: 100 entities max
- Overlap: 20 entities between consecutive batches
LLM Resolution Prompt
For each batch, the LLM receives entity data:- Duplicates: Same entity with different names → merge proposals
- Variants: Parent-child relationships → EXTENDS relations
/home/daytona/workspace/source/src/sift_kg/resolve/resolver.py:393:
Domain Context
When a domain providessystem_context, it’s prepended to the prompt:
Cross-Type Deduplication
After per-type resolution, sift-kg finds entities with identical names but different types:- Canonical type: The one with more connections (higher degree)
- Reason: “Same name across types (CONCEPT vs PHENOMENON). Relations will be combined.”
/home/daytona/workspace/source/src/sift_kg/resolve/resolver.py:190:
Output
sift resolve produces:
merge_proposals.yaml
Layer 3: Human Review
LLMs make mistakes. Human review validates proposals before applying changes.Interactive Review Process
Thesift review command presents each proposal:
- Approve: Status changes to
CONFIRMED, will be applied - Reject: Status changes to
REJECTED, ignored - Skip: Stays
DRAFT, can review later - Quit: Saves progress, remaining proposals stay
DRAFT
/home/daytona/workspace/source/src/sift_kg/resolve/reviewer.py:39.
Auto-Approval
High-confidence proposals can be auto-confirmed:Relation Review
Flagged relations are also reviewed:Review Files After Changes
After review, files are updated in place: merge_proposals.yamlsift apply-merges.
Layer 4: Apply Merges
The final stage executes confirmed changes to the graph.Merge Operation
For eachCONFIRMED proposal, sift-kg:
1. Merge Node Data
Canonical entity accumulates data from merged members:
/home/daytona/workspace/source/src/sift_kg/resolve/engine.py:95.
2. Rewrite Edges
All relations pointing to/from merged members are redirected to the canonical:
REJECTED in relation_review.yaml are deleted:
/home/daytona/workspace/source/src/sift_kg/resolve/engine.py:140.
Output Statistics
graph_data.json.
Real-World Example
Let’s trace a person entity through all 4 layers:Initial Extractions
Three documents extract variations of the same person:Layer 1: Pre-Dedup
Normalization:Layer 2: LLM Resolution
Entities sent to LLM:Layer 3: Human Review
CONFIRMED.
Layer 4: Apply Merge
Before merge:Advanced Features
Semantic Clustering
For very large entity sets (1000+), use embedding-based clustering instead of alphabetical batching:- Generate embeddings for all entity names
- Cluster by cosine similarity
- Each cluster becomes a batch sent to the LLM
- Semantically similar entities are batched together
- Reduces false negatives (missed duplicates)
- Better for multilingual entities
Domain-Specific Resolution Context
Provide domain context to help the LLM make better decisions:Reviewing Previous Decisions
If you want to re-review previously confirmed/rejected proposals:-
Edit
merge_proposals.yamland change status back toDRAFT: -
Run
sift reviewagain:
Manual Merge Proposals
You can manually add merge proposals tomerge_proposals.yaml:
sift apply-merges to execute.
Performance Considerations
Cost Estimation
Entity resolution costs scale with entity count:Speed Optimization
Use concurrency to parallelize LLM calls:Skipping Resolution
If you’re confident in your extractions or have very clean data:sift resolve later if you discover duplicates.
Best Practices
1. Review High-Impact Merges First
Sort proposals by degree (connection count) to focus on central entities:2. Use Auto-Approval Conservatively
Start with a high threshold:merge_proposals.yaml. If quality is good, lower the threshold:
3. Iterate on Domain Context
If the LLM makes systematic mistakes, add guidance tosystem_context:
4. Combine with Pre-Dedup Tuning
If pre-dedup misses obvious duplicates, you can adjust the SemHash threshold in code:Next Steps
How It Works
Understand the full pipeline from extraction to visualization
Domains
Learn about bundled domains and creating custom schemas