Pipeline Overview
sift-kg transforms unstructured documents into structured knowledge graphs through a multi-stage pipeline. Each stage can be run independently or as part of the full workflow.Stage 1: Extract
The extraction stage processes documents and identifies entities and relationships.Document Ingestion
sift-kg supports 75+ file formats through two extraction backends:- kreuzberg (default): Handles PDF, Word, HTML, Markdown, plain text, and more
- pdfplumber: Alternative PDF extraction with different text recovery strategies
Chunking Strategy
Large documents are split into overlapping chunks:- Default chunk size: 10,000 characters
- Chunks processed concurrently for speed
- Document context generated from first chunk and passed to all subsequent chunks
Entity and Relation Extraction
For each chunk, the LLM:- Identifies entities matching your domain schema
- Extracts relationships between entities
- Assigns confidence scores (0.0-1.0)
- Records source document provenance
Schema Modes
sift-kg operates in two modes:discovered_domain.yaml.
Structured domains provide predefined types with constraints, extraction hints, and review requirements.
Output
Extraction produces one JSON file per document inoutput/extractions/:
Stage 2: Build
The build stage constructs the knowledge graph from extraction results.Pre-Deduplication
Before creating graph nodes, sift-kg runs automatic deduplication to catch obvious duplicates: Layer 1: Deterministic Merging- Unicode normalization (café → cafe)
- Singularization (researchers → researcher)
- Title stripping (Dr. Smith → Smith)
- Exact matches after normalization are merged
- SemHash semantic similarity at 0.95 threshold
- Catches typos, abbreviations, and transliteration variants
- Uses Model2Vec for fast, lightweight embeddings
/home/daytona/workspace/source/src/sift_kg/graph/prededup.py:75 for implementation details.
Graph Construction
Entities become nodes with stable IDs:- Source document provenance
- Confidence scores
- Relation type and attributes
Postprocessing
Optional cleanup steps:- Normalize relation types: Map undefined types to domain schema
- Fix directions: Ensure source/target match domain constraints
- Activate passive relations: Convert passive voice to active
- Remove redundant edges: Prune transitive redundancies
- Prune isolated entities: Remove disconnected nodes
Relation Flagging
Relations are flagged for human review if:- Confidence below review threshold (default: 0.7)
- Relation type has
review_required: truein domain config
relation_review.yaml.
Output
The build stage produces:graph_data.json: Complete knowledge graphrelation_review.yaml: Flagged relations needing review (if any)
Stage 3: Resolve
Entity resolution finds duplicates that pre-dedup missed — different names for the same entity.Batching Strategy
Entities are grouped by type and sorted intelligently:- PERSON entities: Sorted by surname so name variants cluster together
- Other entities: Sorted alphabetically
- Max batch size: 100 entities
- Overlap: 20 entities between consecutive batches
- Prevents duplicates from being missed at batch boundaries
LLM-Based Resolution
For each batch, the LLM receives:- Duplicates: Same entity with different names → merge proposals
- Variants: Parent-child relationships (e.g., “Transformer” vs “GPT-2”) → EXTENDS relations
/home/daytona/workspace/source/src/sift_kg/resolve/resolver.py:338 for the resolution prompt.
Cross-Type Deduplication
After per-type resolution, sift-kg finds entities with identical names but different types:Output
Resolve generates:merge_proposals.yaml: DRAFT merge proposalsrelation_review.yaml: Updated with EXTENDS variant relations
Stage 4: Review
Human review validates LLM proposals before applying changes.Interactive Review
Thesift review command presents proposals one by one:
/home/daytona/workspace/source/src/sift_kg/resolve/reviewer.py:39 for review implementation.
Auto-Approval
High-confidence proposals are auto-approved:Status Tracking
Proposals move through three states:DRAFT: Awaiting reviewCONFIRMED: Approved by humanREJECTED: Rejected by human
CONFIRMED proposals are applied in the next stage.
Stage 5: Apply
The apply stage executes confirmed changes to the graph.Entity Merging
For each confirmed merge proposal:- Merge node data: Combine attributes, keeping canonical values
- Rewrite edges: Point all relations to canonical entity
- Remove merged nodes: Delete duplicate entities
- Drop self-loops: Remove relations created by merging
/home/daytona/workspace/source/src/sift_kg/resolve/engine.py:11:
Relation Rejections
Rejected relations are removed from the graph, including symmetric counterparts.Output
The graph is updated in place:graph_data.json: Updated with merges and rejections applied
- Entities merged
- Nodes removed
- Relations rejected
- Self-loops dropped
Stage 6: Narrate
The narrate stage generates human-readable summaries using the knowledge graph structure.Community Detection
Entities are grouped into communities using the Louvain algorithm for modularity maximization. The LLM then generates descriptive labels for each community based on member entities.Entity Descriptions
For entities with degree ≥ 3, the LLM generates concise descriptions based on:- Entity type and attributes
- Connected entities and relationships
- Source documents
entity_descriptions.json and used in visualizations.
Narrative Generation
The LLM synthesizes:- Overview of the knowledge graph
- Key themes and communities
- Important entities and their relationships
- Timeline of events (if dates are present)
narrative.md as a structured markdown document.
Cost Control
Stage 7: View & Export
Interactive Visualization
Generate an interactive HTML graph:- Physics-based layout
- Click-to-expand neighborhoods
- Filtering by entity type, confidence, source document
- Entity descriptions on hover (if generated)
Export Formats
sift-kg exports to standard graph formats:- Entity types and attributes
- Relation types and confidence scores
- Source document provenance
- Entity descriptions (if generated)
Incremental Processing
sift-kg tracks extraction metadata to avoid re-processing:- Model changed
- Domain changed
- Chunk size changed
Library Usage
All pipeline stages are available as library functions:/home/daytona/workspace/source/src/sift_kg/pipeline.py for complete API documentation.
Next Steps
Domains
Learn about bundled domains and creating custom schemas
Entity Resolution
Deep dive into the 4-layer deduplication approach