Document Ingestion
Ingestion is the process of loading content from a connector, splitting it into chunks, generating embeddings, and storing them for later retrieval.Basic Ingestion
Ingestion Process
The ingestion pipeline performs these steps:- Fetch Content - Connector yields documents with id, content, and metadata
- Content Hashing - Generate SHA-256 hash (CID) to detect changes
- Skip Unchanged - Skip documents with matching CID (no changes)
- Split into Chunks - Use text splitter to break content into smaller pieces
- Generate Embeddings - Create vector embeddings for each chunk
- Store Vectors - Save embeddings and metadata to SQLite
Configuration Options
Connector
Any connector that implements theConnector interface:
Store
The vector store where embeddings are saved:Embedder
Function that converts text to vector embeddings:Splitter (Optional)
Custom text splitting function:Text Splitting
By default, ingestion usesMarkdownTextSplitter from LangChain:
TypeScript Splitting
For code files, use language-aware splitting:- Uses recursive character splitting with 512 character chunks
- Includes 100 character overlap between chunks
- Preserves code structure and context
Custom Splitting
Create your own splitter:Change Detection
Ingestion automatically detects content changes using SHA-256 hashing:- Calculate CID from content
- Compare with stored CID
- Skip if CID matches (no changes)
- Re-process if CID differs (content changed)
Ingestion Strategies
Connectors can specify when to ingest usingingestWhen:
contentChanged (Default)
never
expired
Batching
Ingestion automatically batches embeddings to control memory usage:Progress Tracking
Track ingestion progress with a callback:Multiple Sources
Ingest from multiple connectors:sourceId for tracking.
Error Handling
Best Practices
Choose Appropriate Chunk Sizes Smaller chunks (512 chars) for code, larger chunks (1000+ chars) for prose. Use Language-Aware Splitting For code files, use language-specific splitters likesplitTypeScript.
Batch Large Jobs
Ingestion automatically batches, but you can also batch connector sources.
Track Progress
Use the progress callback for long-running ingestion jobs.
Handle Errors Gracefully
Wrap ingestion in try-catch and log failures without stopping the entire job.
Next Steps
Connectors
Explore available data connectors
Search
Search ingested content
Embeddings
Learn about embedding models