Arcana splits documents into smaller chunks before embedding. This improves retrieval accuracy by creating focused, semantically coherent segments.
Quick Start
# config/config.exs
# Default chunker (450 tokens, 50 token overlap)
config :arcana , chunker: :default
# Custom chunk sizes
config :arcana , chunker: { :default , chunk_size: 512 , chunk_overlap: 100 }
# Custom chunker module
config :arcana , chunker: MyApp . SemanticChunker
Default Chunker
The default chunker uses the text_chunker library to split text with overlapping windows.
Basic Configuration
# config/config.exs
# Use defaults (450 tokens, 50 token overlap)
config :arcana , chunker: :default
# Customize chunk size and overlap
config :arcana , chunker: {
:default ,
chunk_size: 512 ,
chunk_overlap: 100
}
Options
Option Default Description :chunk_size450Maximum tokens per chunk :chunk_overlap50Overlapping tokens between chunks :format:plaintextText format (:plaintext, :markdown, :elixir) :size_unit:tokensMeasurement unit (:tokens, :characters)
Examples
Token-Based
Character-Based
Markdown-Aware
Code-Aware
Best for most use cases config :arcana , chunker: {
:default ,
chunk_size: 450 , # ~300-400 words
chunk_overlap: 50 , # ~30-40 words overlap
size_unit: :tokens
}
Tokens approximate words (1 token ≈ 0.75 words for English). For precise byte limits config :arcana , chunker: {
:default ,
chunk_size: 2000 , # 2000 characters
chunk_overlap: 200 , # 200 character overlap
size_unit: :characters
}
Respects markdown structure config :arcana , chunker: {
:default ,
chunk_size: 512 ,
chunk_overlap: 50 ,
format: :markdown # Preserves headings and lists
}
Markdown format attempts to split at semantic boundaries (headings, paragraphs). For Elixir code config :arcana , chunker: {
:default ,
chunk_size: 300 ,
chunk_overlap: 30 ,
format: :elixir # Respects function boundaries
}
Chunk Size Guidelines
Choose chunk size based on your use case:
Small Chunks (200-300) Best for:
Precise answers
FAQ-style content
Short definitions
Trade-offs:
May lack context
More chunks to process
Medium Chunks (400-600) Best for:
General purpose
Articles and blogs
Documentation
Trade-offs:
Good balance of context and precision
Large Chunks (800-1000) Best for:
Rich context needed
Complex explanations
Long-form content
Trade-offs:
May dilute relevance
Slower embedding
Custom Strategy Best for:
Domain-specific needs
Semantic boundaries
Multi-document synthesis
Trade-offs:
Requires custom implementation
Chunk Overlap
Overlap ensures important information at chunk boundaries isn’t lost:
# Example with overlap
chunk_size: 100 , chunk_overlap: 20
# Chunk 1: [word1...word100]
# Chunk 2: [word81...word180] # Last 20 words of Chunk 1 repeated
# Chunk 3: [word161...word260]
Recommended overlap: 10-15% of chunk size
Too much overlap wastes storage and embedding compute. Too little risks losing context at boundaries.
Custom Chunker
Implement the Arcana.Chunker behaviour for custom splitting logic.
Semantic Chunker Example
defmodule MyApp . SemanticChunker do
@behaviour Arcana . Chunker
@impl true
def chunk (text, opts) do
# Split on double newlines (paragraph boundaries)
text
|> String . split ( ~r/ \n\n +/ )
|> Enum . reject ( & blank? / 1 )
|> combine_small_paragraphs (opts)
|> Enum . with_index ()
|> Enum . map ( fn {text, index} ->
%{
text: text,
chunk_index: index,
token_count: estimate_tokens (text)
}
end )
end
defp combine_small_paragraphs (paragraphs, opts) do
min_size = Keyword . get (opts, :min_chunk_size , 100 )
paragraphs
|> Enum . chunk_while (
"" ,
fn para, acc ->
combined = if acc == "" , do: para, else: acc <> " \n\n " <> para
if estimate_tokens (combined) >= min_size do
{ :cont , combined, "" }
else
{ :cont , combined}
end
end ,
fn acc -> { :cont , acc, "" } end
)
|> Enum . reject ( & blank? / 1 )
end
defp estimate_tokens (text) do
# Rough estimate: ~4 chars per token
max ( 1 , div ( String . length (text), 4 ))
end
defp blank? (str), do: String . trim (str) == ""
end
Configuration
# config/config.exs
config :arcana , chunker: MyApp . SemanticChunker
# With options
config :arcana , chunker: { MyApp . SemanticChunker , min_chunk_size: 150 }
Your chunker must return a list of maps with these required keys:
[
%{
text: "chunk content" , # Required: the chunk text
chunk_index: 0 , # Required: zero-based index
token_count: 120 # Required: estimated tokens
},
# ... more chunks
]
You can include additional keys in the chunk map. They’ll be stored in the chunk’s metadata.
Function-Based Chunker
For simple cases, provide a function directly:
config :arcana , chunker: fn text, _opts ->
# Split on sentences
text
|> String . split ( ~r/[.!?]+ \s +/ )
|> Enum . with_index ()
|> Enum . map ( fn {sentence, idx} ->
%{
text: sentence,
chunk_index: idx,
token_count: div ( String . length (sentence), 4 )
}
end )
end
Per-Call Override
Override the global chunker for specific ingestions:
# Use custom chunker for this document
Arcana . ingest (text,
repo: MyApp . Repo ,
chunker: MyApp . SemanticChunker
)
# Override with custom options
Arcana . ingest (text,
repo: MyApp . Repo ,
chunker: { :default , chunk_size: 1000 , chunk_overlap: 200 }
)
# Use a function
Arcana . ingest (text,
repo: MyApp . Repo ,
chunker: fn text, _opts ->
# Custom chunking logic
[ .. .]
end
)
Advanced Chunker Patterns
Markdown-Aware Chunker
Split on headings while respecting chunk size:
defmodule MyApp . MarkdownChunker do
@behaviour Arcana . Chunker
@impl true
def chunk (text, opts) do
max_size = Keyword . get (opts, :chunk_size , 450 )
# Split on markdown headings
text
|> String . split ( ~r/ \n (?=#+ \s )/ )
|> Enum . flat_map ( & maybe_split_large_section ( &1 , max_size))
|> Enum . with_index ()
|> Enum . map ( fn {text, idx} ->
%{
text: String . trim (text),
chunk_index: idx,
token_count: estimate_tokens (text)
}
end )
end
defp maybe_split_large_section (section, max_size) do
tokens = estimate_tokens (section)
if tokens <= max_size do
[section]
else
# Section too large, split by paragraphs
section
|> String . split ( ~r/ \n\n +/ )
|> chunk_paragraphs (max_size)
end
end
defp chunk_paragraphs (paragraphs, max_size) do
# Combine paragraphs until max_size reached
# Implementation left as exercise
paragraphs
end
defp estimate_tokens (text), do: max ( 1 , div ( String . length (text), 4 ))
end
Sliding Window Chunker
Create overlapping chunks with precise control:
defmodule MyApp . SlidingWindowChunker do
@behaviour Arcana . Chunker
@impl true
def chunk (text, opts) do
window_size = Keyword . get (opts, :chunk_size , 450 )
stride = Keyword . get (opts, :stride , 400 )
tokens = tokenize (text)
tokens
|> Enum . chunk_every (window_size, stride, :discard )
|> Enum . with_index ()
|> Enum . map ( fn {chunk_tokens, idx} ->
%{
text: Enum . join (chunk_tokens, " " ),
chunk_index: idx,
token_count: length (chunk_tokens)
}
end )
end
defp tokenize (text) do
# Simple whitespace tokenization
String . split (text, ~r/ \s +/ )
end
end
Testing Chunkers
Test your chunker implementation:
defmodule MyApp . SemanticChunkerTest do
use ExUnit . Case
test "chunks text into semantic segments" do
text = """
First paragraph with some content.
Second paragraph with more content.
Third paragraph.
"""
chunks = MyApp . SemanticChunker . chunk (text, [])
assert length (chunks) == 3
assert Enum . at (chunks, 0 ).chunk_index == 0
assert Enum . at (chunks, 0 ).text =~ "First paragraph"
end
test "respects minimum chunk size" do
text = "Short. \n\n Also short. \n\n Combined they're long enough."
chunks = MyApp . SemanticChunker . chunk (text, min_chunk_size: 50 )
# Should combine short paragraphs
assert length (chunks) < 3
end
end
Best Practices
Match content type - Use markdown format for docs, plaintext for articles
Test with real data - Chunk sizes that work in theory may not work in practice
Preserve context - Use 10-15% overlap to avoid losing information
Consider embedding model limits - Most models max out at 512 tokens
Monitor chunk distribution - Aim for consistent chunk sizes
Use semantic boundaries - Split at paragraphs/sections when possible
Troubleshooting
Increase chunk_size or implement logic to combine small chunks: config :arcana , chunker: { :default , chunk_size: 600 }
Important info split across chunks
Increase chunk_overlap: config :arcana , chunker: { :default , chunk_overlap: 100 }
Search results lack context
Use larger chunks or implement a custom chunker that respects semantic boundaries: config :arcana , chunker: { :default , chunk_size: 800 }
Too many chunks generated
Increase chunk size or reduce overlap: config :arcana , chunker: {
:default ,
chunk_size: 600 ,
chunk_overlap: 30
}
Next Steps
Embeddings Configure embedding providers
Vector Stores Choose a storage backend