Why Chunking Matters
Chunking is the process of splitting documents into smaller segments before embedding. It’s crucial for RAG because:Context Windows
LLMs have token limits (GPT-4: 128K tokens). Entire documents often don’t fit.
Relevance
Smaller chunks = more precise retrieval. “Page 47, paragraph 3” is more useful than “entire manual”.
Embedding Quality
Embeddings capture meaning better for focused segments vs entire documents.
Search Precision
Find exactly relevant sections without irrelevant surrounding text.
How Chunking Works in Arcana
Fromlib/arcana/ingest.ex:56:
Chunk Structure
Every chunk returned by a chunker must include:lib/arcana/chunker.ex:49-58.
Default Chunker Configuration
Arcana uses thetext_chunker library with smart defaults:
- Configuration
- Size Units
- Overlap
Defaults (from Configure globally:Or per-ingestion:
lib/arcana/chunker/default.ex:26-29):Format-Aware Chunking
The default chunker preserves document structure:- Markdown
- Code
- Plain Text
Respects headings and sections:Benefits:
- Keeps related content together
- Preserves hierarchical structure
- Better semantic coherence
Chunking Best Practices
Chunk Size Guidelines
- Small Chunks (200-400 tokens)
- Medium Chunks (400-600 tokens)
- Large Chunks (800-1200 tokens)
Best for:Pros:
- Precise fact retrieval
- Question answering
- FAQ documents
- API references
- ✅ High precision
- ✅ Fast embedding generation
- ✅ Fits more context chunks in LLM window
- ❌ May split related concepts
- ❌ More chunks to search through
- ❌ Less context per chunk
Overlap Recommendations
Context Window Calculation
Ensure chunks fit in LLM context:Custom Chunkers
Implement theArcana.Chunker behaviour for custom logic:
- Semantic Chunker
- Sliding Window Chunker
- Heading-Based Chunker
Split by topic changes using embeddings:
Real-World Examples
- Documentation
- API Reference
- Research Papers
- Customer Support
Optimization Tips
Measure Retrieval Quality
Common Pitfalls
Best Practices Summary
Use Format Hints
Always specify format for structured content:
Token-Based Sizing
Use tokens (not characters) for LLM compatibility:
Test with Real Queries
Evaluate chunking with actual search queries:
Monitor Statistics
Track chunk size distribution and adjust:
Next Steps
RAG Pipeline
See how chunking fits in the complete RAG workflow
Embeddings
Learn how chunks are converted to vector embeddings
Search Modes
Understand how chunked content is searched
Evaluation
Measure and optimize your chunking strategy