Document Chunking

Chunking is the process of splitting documents into smaller, semantically meaningful pieces. Proper chunking is critical for RAG performance - chunks that are too large lose precision, while chunks that are too small lose context.

Chunking Strategies

Mastra provides multiple strategies optimized for different content types:

Recursive

Hierarchically splits text using multiple separators. Best for general content.

Markdown

Preserves markdown structure and headers. Ideal for documentation.

HTML

Respects HTML structure and sections. Use for web content.

Semantic

Groups semantically related content. Best for narrative text.

Code

Language-aware splitting. Preserves code structure.

JSON

Recursive JSON splitting. Handles nested structures.

Recursive Chunking (Default)

Recursive chunking splits text using a hierarchy of separators:

import { MDocument } from '@mastra/rag';

const doc = MDocument.fromText(content);

const chunks = await doc.chunk({
  strategy: 'recursive',
  maxSize: 1000,       // Maximum characters per chunk
  overlap: 100,        // Characters of overlap between chunks
  separatorPosition: 'end' // Where to place separator ('start' or 'end')
});

maxSize

number

default:"1000"

Maximum chunk size in characters

overlap

number

default:"200"

Number of characters to overlap between chunks

separators

string[]

Custom separator hierarchy. Defaults to ['\n\n', '\n', ' ', '']

separatorPosition

'start' | 'end'

default:"'end'"

Where to place the separator in chunks

Language-Aware Chunking

For code files, specify the language for syntax-aware chunking:

const doc = MDocument.fromText(typescriptCode);

const chunks = await doc.chunk({
  strategy: 'recursive',
  language: 'typescript',
  maxSize: 1500,
  overlap: 150
});

Supported languages:

language: 'typescript' // or 'js'

See the Language enum for the full list.

Markdown Chunking

Preserve markdown structure and hierarchy:

const doc = MDocument.fromMarkdown(content);

const chunks = await doc.chunk({
  strategy: 'markdown',
  maxSize: 1000,
  overlap: 100,
  headers: [
    ['#', 'h1'],
    ['##', 'h2'],
    ['###', 'h3']
  ],
  returnEachLine: false,
  stripHeaders: false
});

headers

[string, string][]

Header patterns to preserve: [markdown_prefix, header_name]

returnEachLine

boolean

default:"false"

Return each line as a separate chunk

stripHeaders

boolean

default:"false"

Remove header markers from chunk text

Header Metadata

Markdown chunking adds header hierarchy to metadata:

const chunks = await doc.chunk({
  strategy: 'markdown',
  headers: [['#', 'h1'], ['##', 'h2']]
});

// Chunk metadata includes headers
console.log(chunks[0].metadata);
/*
{
  h1: "Getting Started",
  h2: "Installation",
  startIndex: 0
}
*/

HTML Chunking

Split HTML by semantic sections:

const doc = MDocument.fromHTML(htmlContent);

// By headers
const chunks = await doc.chunk({
  strategy: 'html',
  headers: [
    ['h1', 'Header 1'],
    ['h2', 'Header 2']
  ],
  maxSize: 1000
});

// By sections
const chunks = await doc.chunk({
  strategy: 'html',
  sections: [
    ['article', 'Article'],
    ['section', 'Section'],
    ['div', 'Division']
  ],
  maxSize: 1000
});

Semantic Markdown Chunking

Group semantically related content:

const doc = MDocument.fromMarkdown(content);

const chunks = await doc.chunk({
  strategy: 'semantic-markdown',
  maxSize: 800,
  overlap: 80,
  joinThreshold: 0.5, // Semantic similarity threshold
  modelName: 'gpt-4', // For token counting
  encodingName: 'cl100k_base'
});

joinThreshold

number

default:"0.5"

Semantic similarity threshold (0-1) for joining chunks

modelName

TiktokenModel

Model for token counting

encodingName

TiktokenEncoding

Encoding for token counting

JSON Chunking

Handle nested JSON structures:

const doc = MDocument.fromJSON(jsonString);

const chunks = await doc.chunk({
  strategy: 'json',
  maxSize: 2000,
  minSize: 500,
  ensureAscii: false,
  convertLists: true
});

maxSize

number

required

Maximum chunk size (required for JSON)

minSize

number

Minimum chunk size before splitting

ensureAscii

boolean

default:"false"

Escape non-ASCII characters

convertLists

boolean

default:"false"

Convert lists to separate chunks

Token-Based Chunking

Chunk by token count instead of characters:

const doc = MDocument.fromText(content);

const chunks = await doc.chunk({
  strategy: 'token',
  maxSize: 512,  // Max tokens
  overlap: 50,   // Token overlap
  modelName: 'gpt-4',
  encodingName: 'cl100k_base'
});

Token-based chunking is useful when you need precise token counts for embedding models with token limits.

Sentence Chunking

Split by sentences while respecting size limits:

const doc = MDocument.fromText(content);

const chunks = await doc.chunk({
  strategy: 'sentence',
  maxSize: 1000,
  minSize: 200,
  targetSize: 500,
  sentenceEnders: ['.', '!', '?'],
  fallbackToWords: true,
  fallbackToCharacters: true
});

maxSize

number

required

Maximum chunk size

minSize

number

Minimum chunk size

targetSize

number

Target chunk size to aim for

sentenceEnders

string[]

Characters that end sentences

fallbackToWords

boolean

Fall back to word splitting if sentences are too long

fallbackToCharacters

boolean

Fall back to character splitting if words are too long

Character Chunking

Simple splitting by separator:

const doc = MDocument.fromText(content);

const chunks = await doc.chunk({
  strategy: 'character',
  maxSize: 1000,
  separator: '\n\n',
  isSeparatorRegex: false
});

Chunk Overlap

Overlap maintains context between chunks:

const chunks = await doc.chunk({
  strategy: 'recursive',
  maxSize: 1000,
  overlap: 150 // 15% overlap
});

// Chunks have overlapping content
// Chunk 1: characters 0-1000
// Chunk 2: characters 850-1850 (150 char overlap)
// Chunk 3: characters 1700-2700 (150 char overlap)

Recommended overlap: 10-20% of chunk size. For example, with 1000 char chunks, use 100-200 overlap.

Metadata Extraction

Extract metadata during chunking:

const doc = MDocument.fromMarkdown(content);

const chunks = await doc.chunk({
  strategy: 'markdown',
  maxSize: 1000,
  extract: {
    title: true,
    summary: {
      model: 'openai/gpt-4o-mini',
      maxTokens: 100
    },
    keywords: {
      model: 'openai/gpt-4o-mini',
      maxKeywords: 5
    },
    questions: {
      model: 'openai/gpt-4o-mini',
      maxQuestions: 3
    }
  }
});

// Chunks have enriched metadata
console.log(chunks[0].metadata.title);
console.log(chunks[0].metadata.summary);
console.log(chunks[0].metadata.keywords);
console.log(chunks[0].metadata.questions);

Metadata extraction runs after chunking, enriching each chunk with AI-generated context.

Custom Length Functions

Use custom length calculations:

import { countTokens } from './tokenizer';

const chunks = await doc.chunk({
  strategy: 'recursive',
  maxSize: 500,
  lengthFunction: (text) => countTokens(text)
});

Complete Chunking Pipeline

Put it all together:

import { MDocument } from '@mastra/rag';
import { readFile } from 'fs/promises';

// 1. Load document
const content = await readFile('docs/guide.md', 'utf-8');
const doc = MDocument.fromMarkdown(content, {
  source: 'docs/guide.md',
  category: 'documentation',
  version: '1.0'
});

// 2. Chunk with metadata extraction
const chunks = await doc.chunk({
  strategy: 'markdown',
  maxSize: 1000,
  overlap: 100,
  headers: [
    ['#', 'h1'],
    ['##', 'h2'],
    ['###', 'h3']
  ],
  extract: {
    title: true,
    summary: { model: 'openai/gpt-4o-mini' },
    keywords: { maxKeywords: 5 }
  }
});

console.log(`Created ${chunks.length} chunks`);

// 3. Access chunk data
chunks.forEach((chunk, i) => {
  console.log(`\nChunk ${i + 1}:`);
  console.log(`Title: ${chunk.metadata.title}`);
  console.log(`Summary: ${chunk.metadata.summary}`);
  console.log(`Keywords: ${chunk.metadata.keywords?.join(', ')}`);
  console.log(`Text length: ${chunk.text.length} chars`);
});

Choosing Chunk Size

Optimal chunk size depends on your use case:

Small (200-500)
Medium (500-1000)
Large (1000-2000)

Best for:

Precise information retrieval
Question answering
Fact extraction

Pros:

High precision
Lower token usage

Cons:

May lose context
More chunks to manage

Best Practices

Match Content Type

Use markdown chunking for docs, code chunking for code, semantic for narratives.

Test Different Sizes

Experiment with 500, 1000, and 1500 character chunks to find optimal size.

Use 10-20% Overlap

Overlap maintains context across chunk boundaries without excessive duplication.

Extract Metadata

Add titles, summaries, and keywords to improve retrieval accuracy.

Troubleshooting

Chunks too small

Increase maxSize
Reduce number of separators
Use character or token chunking for uniform sizes

Chunks too large

Decrease maxSize
Add more separators to hierarchy
Use sentence chunking with maxSize limit

Lost context between chunks

Increase overlap (try 20% of chunk size)
Use larger chunks
Extract summaries for each chunk

Poor retrieval quality

Experiment with different chunk sizes
Add metadata extraction
Try semantic or markdown chunking
Verify separator hierarchy matches content structure

Next Steps

Ingestion

Learn about document loading and preprocessing

Retrieval

Implement semantic search and reranking

RAG Overview

Return to RAG overview

Get Started

Core Concepts

Agents

Workflows

Memory

RAG

Tools & MCP

Storage

Server & API

Observability

Evals

Deployment

​Chunking Strategies

Recursive

Markdown

HTML

Semantic

Code

JSON

​Recursive Chunking (Default)

​Language-Aware Chunking

​Markdown Chunking

​Header Metadata

​HTML Chunking

​Semantic Markdown Chunking

​JSON Chunking

​Token-Based Chunking

​Sentence Chunking

​Character Chunking

​Chunk Overlap

​Metadata Extraction

​Custom Length Functions

​Complete Chunking Pipeline

​Choosing Chunk Size

​Best Practices

Match Content Type

Test Different Sizes

Use 10-20% Overlap

Extract Metadata

​Troubleshooting

​Next Steps

Ingestion

Retrieval

RAG Overview

Build docs developers (and LLMs) love

Chunking Strategies

Recursive Chunking (Default)

Language-Aware Chunking

Markdown Chunking

Header Metadata

HTML Chunking

Semantic Markdown Chunking

JSON Chunking

Token-Based Chunking

Sentence Chunking

Character Chunking

Chunk Overlap

Metadata Extraction

Custom Length Functions

Complete Chunking Pipeline

Choosing Chunk Size

Best Practices

Troubleshooting

Next Steps