The chunk size dilemma
When indexing documents, you face a trade-off:Small chunks (128-256 tokens)
Pros:
- Precise semantic matching
- Low false positive rate
- Better retrieval accuracy
- Missing surrounding context
- Incomplete information
- Poor for generation
Large chunks (512-1024 tokens)
Pros:
- Rich context for generation
- Complete information
- Better for Q&A
- Noisy retrieval
- Higher false positives
- Worse precision
How it works
-
Split documents into parent and child chunks
- Parents: Large chunks (512-1024 tokens) with full context
- Children: Small chunks (128-256 tokens) for precise matching
-
Index only child chunks
- Store child embeddings in vector database
- Each child has metadata linking to its parent ID
-
Store parent documents separately
- ParentDocumentStore maintains parent text in memory
- Maps child IDs to parent IDs and full parent text
-
During search
- Retrieve top-k child chunks from vector DB
- Map child IDs to parent IDs
- Return unique parent documents (deduplicated)
Basic usage
Configuration
ParentDocumentStore
The parent store maintains chunk-to-parent mappings in memory:Indexing pipeline internals
Here’s how the LangChain indexing pipeline works:Search pipeline internals
How search retrieves children but returns parents:Why over-fetch children?
The search pipeline retrievestop_k * 2 children because:
- Multiple children may belong to the same parent
- After deduplication, you might have fewer than top_k unique parents
- Over-fetching ensures you have enough unique parents
Chunk size recommendations
- General purpose
- Technical docs
- Short-form content
- Long-form content
Trade-offs
Storage overhead
Storage overhead
- Child chunks are indexed in vector DB (normal storage)
- Parent documents stored in ParentDocumentStore (in-memory pickle file)
- Total storage: ~1.5-2x standard indexing
- Mitigation: Parent store can be compressed or moved to Redis/database
Memory usage
Memory usage
- ParentDocumentStore loads into memory during search
- For 1M parents with 1KB text each: ~1GB RAM
- Mitigation: Use database-backed parent store for production
Deduplication complexity
Deduplication complexity
- Need to track which parents have been returned
- Over-fetching required to ensure enough unique parents
- Benefit: Handled automatically by ParentDocumentStore
Production considerations
See also
- Contextual compression - Further reduce parent document tokens
- Query enhancement - Improve child chunk retrieval
- Chunking strategies - Optimize chunk sizes for your content