Skip to main content
Arcana splits documents into smaller chunks before embedding. This improves retrieval accuracy by creating focused, semantically coherent segments.

Quick Start

# config/config.exs

# Default chunker (450 tokens, 50 token overlap)
config :arcana, chunker: :default

# Custom chunk sizes
config :arcana, chunker: {:default, chunk_size: 512, chunk_overlap: 100}

# Custom chunker module
config :arcana, chunker: MyApp.SemanticChunker

Default Chunker

The default chunker uses the text_chunker library to split text with overlapping windows.

Basic Configuration

# config/config.exs

# Use defaults (450 tokens, 50 token overlap)
config :arcana, chunker: :default

# Customize chunk size and overlap
config :arcana, chunker: {
  :default,
  chunk_size: 512,
  chunk_overlap: 100
}

Options

OptionDefaultDescription
:chunk_size450Maximum tokens per chunk
:chunk_overlap50Overlapping tokens between chunks
:format:plaintextText format (:plaintext, :markdown, :elixir)
:size_unit:tokensMeasurement unit (:tokens, :characters)

Examples

Best for most use cases
config :arcana, chunker: {
  :default,
  chunk_size: 450,      # ~300-400 words
  chunk_overlap: 50,    # ~30-40 words overlap
  size_unit: :tokens
}
Tokens approximate words (1 token ≈ 0.75 words for English).

Chunk Size Guidelines

Choose chunk size based on your use case:

Small Chunks (200-300)

Best for:
  • Precise answers
  • FAQ-style content
  • Short definitions
Trade-offs:
  • May lack context
  • More chunks to process

Medium Chunks (400-600)

Best for:
  • General purpose
  • Articles and blogs
  • Documentation
Trade-offs:
  • Good balance of context and precision

Large Chunks (800-1000)

Best for:
  • Rich context needed
  • Complex explanations
  • Long-form content
Trade-offs:
  • May dilute relevance
  • Slower embedding

Custom Strategy

Best for:
  • Domain-specific needs
  • Semantic boundaries
  • Multi-document synthesis
Trade-offs:
  • Requires custom implementation

Chunk Overlap

Overlap ensures important information at chunk boundaries isn’t lost:
# Example with overlap
chunk_size: 100, chunk_overlap: 20

# Chunk 1: [word1...word100]
# Chunk 2:        [word81...word180]  # Last 20 words of Chunk 1 repeated
# Chunk 3:                  [word161...word260]
Recommended overlap: 10-15% of chunk size
Too much overlap wastes storage and embedding compute. Too little risks losing context at boundaries.

Custom Chunker

Implement the Arcana.Chunker behaviour for custom splitting logic.

Semantic Chunker Example

defmodule MyApp.SemanticChunker do
  @behaviour Arcana.Chunker

  @impl true
  def chunk(text, opts) do
    # Split on double newlines (paragraph boundaries)
    text
    |> String.split(~r/\n\n+/)
    |> Enum.reject(&blank?/1)
    |> combine_small_paragraphs(opts)
    |> Enum.with_index()
    |> Enum.map(fn {text, index} ->
      %{
        text: text,
        chunk_index: index,
        token_count: estimate_tokens(text)
      }
    end)
  end

  defp combine_small_paragraphs(paragraphs, opts) do
    min_size = Keyword.get(opts, :min_chunk_size, 100)
    
    paragraphs
    |> Enum.chunk_while(
      "",
      fn para, acc ->
        combined = if acc == "", do: para, else: acc <> "\n\n" <> para
        
        if estimate_tokens(combined) >= min_size do
          {:cont, combined, ""}
        else
          {:cont, combined}
        end
      end,
      fn acc -> {:cont, acc, ""} end
    )
    |> Enum.reject(&blank?/1)
  end

  defp estimate_tokens(text) do
    # Rough estimate: ~4 chars per token
    max(1, div(String.length(text), 4))
  end

  defp blank?(str), do: String.trim(str) == ""
end

Configuration

# config/config.exs
config :arcana, chunker: MyApp.SemanticChunker

# With options
config :arcana, chunker: {MyApp.SemanticChunker, min_chunk_size: 150}

Chunk Format

Your chunker must return a list of maps with these required keys:
[
  %{
    text: "chunk content",      # Required: the chunk text
    chunk_index: 0,              # Required: zero-based index
    token_count: 120             # Required: estimated tokens
  },
  # ... more chunks
]
You can include additional keys in the chunk map. They’ll be stored in the chunk’s metadata.

Function-Based Chunker

For simple cases, provide a function directly:
config :arcana, chunker: fn text, _opts ->
  # Split on sentences
  text
  |> String.split(~r/[.!?]+\s+/)
  |> Enum.with_index()
  |> Enum.map(fn {sentence, idx} ->
    %{
      text: sentence,
      chunk_index: idx,
      token_count: div(String.length(sentence), 4)
    }
  end)
end

Per-Call Override

Override the global chunker for specific ingestions:
# Use custom chunker for this document
Arcana.ingest(text,
  repo: MyApp.Repo,
  chunker: MyApp.SemanticChunker
)

# Override with custom options
Arcana.ingest(text,
  repo: MyApp.Repo,
  chunker: {:default, chunk_size: 1000, chunk_overlap: 200}
)

# Use a function
Arcana.ingest(text,
  repo: MyApp.Repo,
  chunker: fn text, _opts ->
    # Custom chunking logic
    [...]
  end
)

Advanced Chunker Patterns

Markdown-Aware Chunker

Split on headings while respecting chunk size:
defmodule MyApp.MarkdownChunker do
  @behaviour Arcana.Chunker

  @impl true
  def chunk(text, opts) do
    max_size = Keyword.get(opts, :chunk_size, 450)
    
    # Split on markdown headings
    text
    |> String.split(~r/\n(?=#+\s)/)
    |> Enum.flat_map(&maybe_split_large_section(&1, max_size))
    |> Enum.with_index()
    |> Enum.map(fn {text, idx} ->
      %{
        text: String.trim(text),
        chunk_index: idx,
        token_count: estimate_tokens(text)
      }
    end)
  end

  defp maybe_split_large_section(section, max_size) do
    tokens = estimate_tokens(section)
    
    if tokens <= max_size do
      [section]
    else
      # Section too large, split by paragraphs
      section
      |> String.split(~r/\n\n+/)
      |> chunk_paragraphs(max_size)
    end
  end

  defp chunk_paragraphs(paragraphs, max_size) do
    # Combine paragraphs until max_size reached
    # Implementation left as exercise
    paragraphs
  end

  defp estimate_tokens(text), do: max(1, div(String.length(text), 4))
end

Sliding Window Chunker

Create overlapping chunks with precise control:
defmodule MyApp.SlidingWindowChunker do
  @behaviour Arcana.Chunker

  @impl true
  def chunk(text, opts) do
    window_size = Keyword.get(opts, :chunk_size, 450)
    stride = Keyword.get(opts, :stride, 400)
    
    tokens = tokenize(text)
    
    tokens
    |> Enum.chunk_every(window_size, stride, :discard)
    |> Enum.with_index()
    |> Enum.map(fn {chunk_tokens, idx} ->
      %{
        text: Enum.join(chunk_tokens, " "),
        chunk_index: idx,
        token_count: length(chunk_tokens)
      }
    end)
  end

  defp tokenize(text) do
    # Simple whitespace tokenization
    String.split(text, ~r/\s+/)
  end
end

Testing Chunkers

Test your chunker implementation:
defmodule MyApp.SemanticChunkerTest do
  use ExUnit.Case

  test "chunks text into semantic segments" do
    text = """
    First paragraph with some content.

    Second paragraph with more content.

    Third paragraph.
    """

    chunks = MyApp.SemanticChunker.chunk(text, [])

    assert length(chunks) == 3
    assert Enum.at(chunks, 0).chunk_index == 0
    assert Enum.at(chunks, 0).text =~ "First paragraph"
  end

  test "respects minimum chunk size" do
    text = "Short.\n\nAlso short.\n\nCombined they're long enough."
    
    chunks = MyApp.SemanticChunker.chunk(text, min_chunk_size: 50)

    # Should combine short paragraphs
    assert length(chunks) < 3
  end
end

Best Practices

  1. Match content type - Use markdown format for docs, plaintext for articles
  2. Test with real data - Chunk sizes that work in theory may not work in practice
  3. Preserve context - Use 10-15% overlap to avoid losing information
  4. Consider embedding model limits - Most models max out at 512 tokens
  5. Monitor chunk distribution - Aim for consistent chunk sizes
  6. Use semantic boundaries - Split at paragraphs/sections when possible

Troubleshooting

Increase chunk_size or implement logic to combine small chunks:
config :arcana, chunker: {:default, chunk_size: 600}
Increase chunk_overlap:
config :arcana, chunker: {:default, chunk_overlap: 100}
Use larger chunks or implement a custom chunker that respects semantic boundaries:
config :arcana, chunker: {:default, chunk_size: 800}
Increase chunk size or reduce overlap:
config :arcana, chunker: {
  :default,
  chunk_size: 600,
  chunk_overlap: 30
}

Next Steps

Embeddings

Configure embedding providers

Vector Stores

Choose a storage backend

Build docs developers (and LLMs) love