sift build command consolidates extraction results from multiple documents into a single knowledge graph, handling entity merging, relation normalization, and quality checks.
Quick Start
output/extractions/ and saves it to output/graph_data.json.
Command Options
Path to custom domain YAML (must match extraction domain)
Bundled domain name (must match extraction domain)
Output directory. Defaults to
output/.Relations below this confidence are flagged for review
Skip redundancy removal and relation normalization
Enable verbose logging
What Graph Building Does
Deduplicate Entities
Merges near-identical entity names within documents (e.g., “Dr. Smith” → “Smith”)
Create Graph Structure
- Adds entity nodes with attributes and confidence scores
- Creates relation edges between entities
- Tracks source documents for provenance
Postprocessing
Runs unless
--no-postprocess is specified:- Normalizes relation types to match domain schema
- Fixes passive voice relations (“is authored by” → “AUTHORED”)
- Removes redundant edges (duplicate relations)
- Prunes isolated entities with no connections
Examples
Basic Usage
Skipping Postprocessing
Graph Structure
The outputgraph_data.json is a NetworkX MultiDiGraph serialized to JSON:
Entity IDs
Entities get stable IDs based on normalized name + type:Provenance Tracking
Every entity and relation tracks:source_documents: Which documents mention this entitysupport_count: How many times this relation appearssupport_documents: Which documents support this relation
Postprocessing Details
Relation Normalization
Maps undefined relation types to valid domain types:Passive Voice Activation
Rewrites passive relations to active voice:Redundancy Removal
When multiple documents extract the same relation:- Keeps highest confidence
- Merges evidence strings
- Tracks all supporting documents
Isolated Node Pruning
Removes entities with no relations (often extraction errors):Document nodes (
DOCUMENT type) are never pruned—they’re needed for sift view --source-doc.Review Files
If relations are flagged,sift build creates relation_review.yaml:
sift review to approve or reject these relations.
Schema-Free Mode
When using--domain-name schema-free, the graph builder:
- Loads the discovered schema from
output/discovered_domain.yaml - Uses discovered types for normalization
- Allows any entity/relation types not in the schema
Performance
Graph building is fast (no LLM calls):- 100 documents: ~5 seconds
- 1,000 documents: ~30 seconds
- 10,000 documents: ~5 minutes
Output Summary
Troubleshooting
”No extractions found”
Runsift extract first. The extraction directory should contain JSON files:
Domain mismatch errors
If you change domains between extraction and build:Too many flagged relations
Lower the review threshold:Graph too large to visualize
Use filters when viewing:Next Steps
Resolve Duplicates
Find and merge duplicate entities across documents
Visualize Graph
Explore your knowledge graph interactively
Export Data
Export to GraphML, CSV, or SQLite for analysis