Skip to main content

Why Chunking Matters

Chunking is the process of splitting documents into smaller segments before embedding. It’s crucial for RAG because:

Context Windows

LLMs have token limits (GPT-4: 128K tokens). Entire documents often don’t fit.

Relevance

Smaller chunks = more precise retrieval. “Page 47, paragraph 3” is more useful than “entire manual”.

Embedding Quality

Embeddings capture meaning better for focused segments vs entire documents.

Search Precision

Find exactly relevant sections without irrelevant surrounding text.

How Chunking Works in Arcana

From lib/arcana/ingest.ex:56:
# 1. Configure chunker
chunker_config = Arcana.Config.resolve_chunker(opts)
# => {Arcana.Chunker.Default, [chunk_size: 450, chunk_overlap: 50]}

# 2. Split text into chunks
chunks = Chunker.chunk(chunker_config, text, chunk_opts)
# => [
#   %{text: "...", chunk_index: 0, token_count: 342},
#   %{text: "...", chunk_index: 1, token_count: 385},
#   %{text: "...", chunk_index: 2, token_count: 421}
# ]

# 3. Each chunk is embedded and stored separately
Enum.each(chunks, fn chunk ->
  {:ok, embedding} = Embedder.embed(embedder, chunk.text)
  repo.insert!(%Chunk{text: chunk.text, embedding: embedding, ...})
end)

Chunk Structure

Every chunk returned by a chunker must include:
%{
  text: "The actual chunk content...",  # required
  chunk_index: 0,                        # required, 0-based position
  token_count: 342                       # required, estimated tokens
}
From lib/arcana/chunker.ex:49-58.

Default Chunker Configuration

Arcana uses the text_chunker library with smart defaults:
Defaults (from lib/arcana/chunker/default.ex:26-29):
@default_chunk_size 450       # tokens
@default_chunk_overlap 50     # tokens
@default_format :plaintext
@default_size_unit :tokens
Configure globally:
# config/config.exs
config :arcana, 
  chunker: {:default, 
    chunk_size: 512,
    chunk_overlap: 100,
    format: :markdown,
    size_unit: :tokens
  }
Or per-ingestion:
{:ok, document} = Arcana.ingest(text,
  repo: MyApp.Repo,
  chunk_size: 600,
  chunk_overlap: 75,
  format: :markdown
)

Format-Aware Chunking

The default chunker preserves document structure:
Respects headings and sections:
text = """
# Introduction
Elixir is a functional language.

## Key Features
- Concurrency via processes
- Pattern matching

## OTP Framework
Built on Erlang's OTP...
"""

{:ok, document} = Arcana.ingest(text,
  repo: MyApp.Repo,
  format: :markdown,  # Chunker preserves section boundaries
  chunk_size: 300
)

# Result chunks:
# 1. "# Introduction\nElixir is a functional language."
# 2. "## Key Features\n- Concurrency via processes\n- Pattern matching"
# 3. "## OTP Framework\nBuilt on Erlang's OTP..."
Benefits:
  • Keeps related content together
  • Preserves hierarchical structure
  • Better semantic coherence

Chunking Best Practices

Chunk Size Guidelines

Best for:
  • Precise fact retrieval
  • Question answering
  • FAQ documents
  • API references
Example:
# API documentation search
config :arcana, 
  chunker: {:default, 
    chunk_size: 300,
    chunk_overlap: 50
  }

# Each function gets its own chunk:
# Chunk 1: "Arcana.search/2 - Searches for chunks..."
# Chunk 2: "Arcana.ingest/2 - Ingests text content..."
Pros:
  • ✅ High precision
  • ✅ Fast embedding generation
  • ✅ Fits more context chunks in LLM window
Cons:
  • ❌ May split related concepts
  • ❌ More chunks to search through
  • ❌ Less context per chunk

Overlap Recommendations

# Rule of thumb: 10-15% of chunk size

# Small chunks
chunk_size: 300
chunk_overlap: 30   # 10%

# Medium chunks (default)
chunk_size: 450
chunk_overlap: 50   # 11%

# Large chunks
chunk_size: 1000
chunk_overlap: 150  # 15%

# Too little overlap
chunk_overlap: 10   # ❌ Concepts at boundaries get split

# Too much overlap
chunk_overlap: 300  # ❌ Redundant chunks, slower search

Context Window Calculation

Ensure chunks fit in LLM context:
# GPT-4o: 128K token context window
# Leave room for:
# - System prompt: ~500 tokens
# - User question: ~100 tokens
# - Response: ~1000 tokens
# Available for context: ~126,400 tokens

chunk_size = 450
max_chunks = 126_400 / chunk_size
# => ~280 chunks fit (theoretical)

# But you should use far fewer for cost/quality
typical_chunks = 5    # Simple questions
complex_chunks = 15   # Complex questions

typical_tokens = typical_chunks * chunk_size  # 2,250 tokens
complex_tokens = complex_chunks * chunk_size  # 6,750 tokens

# Both easily fit in context window

Custom Chunkers

Implement the Arcana.Chunker behaviour for custom logic:
Split by topic changes using embeddings:
defmodule MyApp.SemanticChunker do
  @behaviour Arcana.Chunker
  
  @impl true
  def chunk(text, opts) do
    # 1. Split into sentences
    sentences = String.split(text, ~r/[.!?]\s+/)
    
    # 2. Embed each sentence
    embeddings = Enum.map(sentences, &embed_sentence/1)
    
    # 3. Find topic boundaries (low similarity = new topic)
    boundaries = find_semantic_boundaries(embeddings, threshold: 0.6)
    
    # 4. Group sentences into chunks at boundaries
    boundaries
    |> group_sentences(sentences)
    |> Enum.with_index()
    |> Enum.map(fn {chunk_text, index} ->
      %{
        text: chunk_text,
        chunk_index: index,
        token_count: estimate_tokens(chunk_text)
      }
    end)
  end
  
  defp find_semantic_boundaries(embeddings, opts) do
    threshold = opts[:threshold]
    
    embeddings
    |> Enum.chunk_every(2, 1, :discard)
    |> Enum.with_index()
    |> Enum.filter(fn {[emb1, emb2], _idx} ->
      cosine_similarity(emb1, emb2) < threshold
    end)
    |> Enum.map(fn {_, idx} -> idx end)
  end
end

# config/config.exs
config :arcana, chunker: MyApp.SemanticChunker

Real-World Examples

# Phoenix documentation ingestion
defmodule MyApp.Docs.Ingest do
  def ingest_phoenix_docs do
    docs_path = "deps/phoenix/guides/"
    
    docs_path
    |> File.ls!()
    |> Enum.filter(&String.ends_with?(&1, ".md"))
    |> Enum.each(fn file ->
      path = Path.join(docs_path, file)
      
      {:ok, document} = Arcana.ingest_file(path,
        repo: MyApp.Repo,
        collection: "phoenix-docs",
        format: :markdown,     # Preserve structure
        chunk_size: 500,       # Medium chunks
        chunk_overlap: 75,     # 15% overlap
        metadata: %{
          source: "Phoenix Guides",
          file: file
        }
      )
      
      IO.puts("Ingested #{file}: #{document.chunk_count} chunks")
    end)
  end
end

# Result: ~200 files → ~3,000 chunks
# Chunk size: 500 tokens = good balance for technical docs
# Each heading section stays together

Optimization Tips

1

Start with Defaults

Use 450 tokens / 50 overlap initially:
# Good starting point for most use cases
config :arcana, 
  chunker: {:default, 
    chunk_size: 450,
    chunk_overlap: 50
  }
2

Measure Retrieval Quality

Use evaluation metrics to test different sizes:
# Test different chunk sizes
[300, 450, 600, 900]
|> Enum.each(fn size ->
  # Re-ingest with new size
  # Run test queries
  # Measure MRR, Recall, Precision
  metrics = Arcana.Evaluation.run(test_cases, 
    chunk_size: size
  )
  
  IO.inspect({size, metrics})
end)
See Evaluation Guide.
3

Adjust Per Content Type

Different content needs different chunking:
defmodule MyApp.Ingest do
  def ingest(content, type) do
    chunk_config = chunk_config_for_type(type)
    
    Arcana.ingest(content,
      repo: MyApp.Repo,
      chunker: chunk_config
    )
  end
  
  defp chunk_config_for_type(:api_docs) do
    {:default, chunk_size: 250, chunk_overlap: 25}
  end
  
  defp chunk_config_for_type(:guide) do
    {:default, chunk_size: 500, chunk_overlap: 75}
  end
  
  defp chunk_config_for_type(:paper) do
    {:default, chunk_size: 1000, chunk_overlap: 150}
  end
end
4

Monitor Chunk Statistics

Track chunk distribution:
defmodule MyApp.ChunkStats do
  import Ecto.Query
  
  def analyze(repo) do
    query = from c in Arcana.Chunk,
      select: %{
        avg_tokens: avg(c.token_count),
        min_tokens: min(c.token_count),
        max_tokens: max(c.token_count),
        total_chunks: count(c.id)
      }
    
    stats = repo.one(query)
    
    IO.inspect(stats)
    # %{
    #   avg_tokens: 412.5,
    #   min_tokens: 89,
    #   max_tokens: 501,
    #   total_chunks: 3421
    # }
  end
end
Ideal distribution: Most chunks near target size, few outliers.

Common Pitfalls

Avoid these mistakes:
  1. Chunks too small (under 150 tokens)
    • Missing context
    • Related concepts split
    • Too many chunks to search
  2. Chunks too large (>1500 tokens)
    • Low precision (too much irrelevant content)
    • Fewer chunks fit in LLM context
    • Slower embedding
  3. No overlap
    • Boundary concepts split
    • Lower retrieval quality
  4. Too much overlap (>25%)
    • Redundant chunks
    • Slower search
    • Wasted storage
  5. Ignoring format
    • Code split mid-function
    • Markdown structure lost
    • Sections fragmented

Best Practices Summary

Use Format Hints

Always specify format for structured content:
format: :markdown  # or :elixir, :python

Token-Based Sizing

Use tokens (not characters) for LLM compatibility:
size_unit: :tokens  # default

Test with Real Queries

Evaluate chunking with actual search queries:
Arcana.Evaluation.run(test_cases)

Monitor Statistics

Track chunk size distribution and adjust:
MyApp.ChunkStats.analyze(repo)

Next Steps

RAG Pipeline

See how chunking fits in the complete RAG workflow

Embeddings

Learn how chunks are converted to vector embeddings

Search Modes

Understand how chunked content is searched

Evaluation

Measure and optimize your chunking strategy

Build docs developers (and LLMs) love