Skip to main content
Chunking is the process of splitting documents into smaller, semantically meaningful pieces. Proper chunking is critical for RAG performance - chunks that are too large lose precision, while chunks that are too small lose context.

Chunking Strategies

Mastra provides multiple strategies optimized for different content types:

Recursive

Hierarchically splits text using multiple separators. Best for general content.

Markdown

Preserves markdown structure and headers. Ideal for documentation.

HTML

Respects HTML structure and sections. Use for web content.

Semantic

Groups semantically related content. Best for narrative text.

Code

Language-aware splitting. Preserves code structure.

JSON

Recursive JSON splitting. Handles nested structures.

Recursive Chunking (Default)

Recursive chunking splits text using a hierarchy of separators:
import { MDocument } from '@mastra/rag';

const doc = MDocument.fromText(content);

const chunks = await doc.chunk({
  strategy: 'recursive',
  maxSize: 1000,       // Maximum characters per chunk
  overlap: 100,        // Characters of overlap between chunks
  separatorPosition: 'end' // Where to place separator ('start' or 'end')
});
maxSize
number
default:"1000"
Maximum chunk size in characters
overlap
number
default:"200"
Number of characters to overlap between chunks
separators
string[]
Custom separator hierarchy. Defaults to ['\n\n', '\n', ' ', '']
separatorPosition
'start' | 'end'
default:"'end'"
Where to place the separator in chunks

Language-Aware Chunking

For code files, specify the language for syntax-aware chunking:
const doc = MDocument.fromText(typescriptCode);

const chunks = await doc.chunk({
  strategy: 'recursive',
  language: 'typescript',
  maxSize: 1500,
  overlap: 150
});
Supported languages:
language: 'typescript' // or 'js'
See the Language enum for the full list.

Markdown Chunking

Preserve markdown structure and hierarchy:
const doc = MDocument.fromMarkdown(content);

const chunks = await doc.chunk({
  strategy: 'markdown',
  maxSize: 1000,
  overlap: 100,
  headers: [
    ['#', 'h1'],
    ['##', 'h2'],
    ['###', 'h3']
  ],
  returnEachLine: false,
  stripHeaders: false
});
headers
[string, string][]
Header patterns to preserve: [markdown_prefix, header_name]
returnEachLine
boolean
default:"false"
Return each line as a separate chunk
stripHeaders
boolean
default:"false"
Remove header markers from chunk text

Header Metadata

Markdown chunking adds header hierarchy to metadata:
const chunks = await doc.chunk({
  strategy: 'markdown',
  headers: [['#', 'h1'], ['##', 'h2']]
});

// Chunk metadata includes headers
console.log(chunks[0].metadata);
/*
{
  h1: "Getting Started",
  h2: "Installation",
  startIndex: 0
}
*/

HTML Chunking

Split HTML by semantic sections:
const doc = MDocument.fromHTML(htmlContent);

// By headers
const chunks = await doc.chunk({
  strategy: 'html',
  headers: [
    ['h1', 'Header 1'],
    ['h2', 'Header 2']
  ],
  maxSize: 1000
});

// By sections
const chunks = await doc.chunk({
  strategy: 'html',
  sections: [
    ['article', 'Article'],
    ['section', 'Section'],
    ['div', 'Division']
  ],
  maxSize: 1000
});

Semantic Markdown Chunking

Group semantically related content:
const doc = MDocument.fromMarkdown(content);

const chunks = await doc.chunk({
  strategy: 'semantic-markdown',
  maxSize: 800,
  overlap: 80,
  joinThreshold: 0.5, // Semantic similarity threshold
  modelName: 'gpt-4', // For token counting
  encodingName: 'cl100k_base'
});
joinThreshold
number
default:"0.5"
Semantic similarity threshold (0-1) for joining chunks
modelName
TiktokenModel
Model for token counting
encodingName
TiktokenEncoding
Encoding for token counting

JSON Chunking

Handle nested JSON structures:
const doc = MDocument.fromJSON(jsonString);

const chunks = await doc.chunk({
  strategy: 'json',
  maxSize: 2000,
  minSize: 500,
  ensureAscii: false,
  convertLists: true
});
maxSize
number
required
Maximum chunk size (required for JSON)
minSize
number
Minimum chunk size before splitting
ensureAscii
boolean
default:"false"
Escape non-ASCII characters
convertLists
boolean
default:"false"
Convert lists to separate chunks

Token-Based Chunking

Chunk by token count instead of characters:
const doc = MDocument.fromText(content);

const chunks = await doc.chunk({
  strategy: 'token',
  maxSize: 512,  // Max tokens
  overlap: 50,   // Token overlap
  modelName: 'gpt-4',
  encodingName: 'cl100k_base'
});
Token-based chunking is useful when you need precise token counts for embedding models with token limits.

Sentence Chunking

Split by sentences while respecting size limits:
const doc = MDocument.fromText(content);

const chunks = await doc.chunk({
  strategy: 'sentence',
  maxSize: 1000,
  minSize: 200,
  targetSize: 500,
  sentenceEnders: ['.', '!', '?'],
  fallbackToWords: true,
  fallbackToCharacters: true
});
maxSize
number
required
Maximum chunk size
minSize
number
Minimum chunk size
targetSize
number
Target chunk size to aim for
sentenceEnders
string[]
Characters that end sentences
fallbackToWords
boolean
Fall back to word splitting if sentences are too long
fallbackToCharacters
boolean
Fall back to character splitting if words are too long

Character Chunking

Simple splitting by separator:
const doc = MDocument.fromText(content);

const chunks = await doc.chunk({
  strategy: 'character',
  maxSize: 1000,
  separator: '\n\n',
  isSeparatorRegex: false
});

Chunk Overlap

Overlap maintains context between chunks:
const chunks = await doc.chunk({
  strategy: 'recursive',
  maxSize: 1000,
  overlap: 150 // 15% overlap
});

// Chunks have overlapping content
// Chunk 1: characters 0-1000
// Chunk 2: characters 850-1850 (150 char overlap)
// Chunk 3: characters 1700-2700 (150 char overlap)
Recommended overlap: 10-20% of chunk size. For example, with 1000 char chunks, use 100-200 overlap.

Metadata Extraction

Extract metadata during chunking:
const doc = MDocument.fromMarkdown(content);

const chunks = await doc.chunk({
  strategy: 'markdown',
  maxSize: 1000,
  extract: {
    title: true,
    summary: {
      model: 'openai/gpt-4o-mini',
      maxTokens: 100
    },
    keywords: {
      model: 'openai/gpt-4o-mini',
      maxKeywords: 5
    },
    questions: {
      model: 'openai/gpt-4o-mini',
      maxQuestions: 3
    }
  }
});

// Chunks have enriched metadata
console.log(chunks[0].metadata.title);
console.log(chunks[0].metadata.summary);
console.log(chunks[0].metadata.keywords);
console.log(chunks[0].metadata.questions);
Metadata extraction runs after chunking, enriching each chunk with AI-generated context.

Custom Length Functions

Use custom length calculations:
import { countTokens } from './tokenizer';

const chunks = await doc.chunk({
  strategy: 'recursive',
  maxSize: 500,
  lengthFunction: (text) => countTokens(text)
});

Complete Chunking Pipeline

Put it all together:
import { MDocument } from '@mastra/rag';
import { readFile } from 'fs/promises';

// 1. Load document
const content = await readFile('docs/guide.md', 'utf-8');
const doc = MDocument.fromMarkdown(content, {
  source: 'docs/guide.md',
  category: 'documentation',
  version: '1.0'
});

// 2. Chunk with metadata extraction
const chunks = await doc.chunk({
  strategy: 'markdown',
  maxSize: 1000,
  overlap: 100,
  headers: [
    ['#', 'h1'],
    ['##', 'h2'],
    ['###', 'h3']
  ],
  extract: {
    title: true,
    summary: { model: 'openai/gpt-4o-mini' },
    keywords: { maxKeywords: 5 }
  }
});

console.log(`Created ${chunks.length} chunks`);

// 3. Access chunk data
chunks.forEach((chunk, i) => {
  console.log(`\nChunk ${i + 1}:`);
  console.log(`Title: ${chunk.metadata.title}`);
  console.log(`Summary: ${chunk.metadata.summary}`);
  console.log(`Keywords: ${chunk.metadata.keywords?.join(', ')}`);
  console.log(`Text length: ${chunk.text.length} chars`);
});

Choosing Chunk Size

Optimal chunk size depends on your use case:
Best for:
  • Precise information retrieval
  • Question answering
  • Fact extraction
Pros:
  • High precision
  • Lower token usage
Cons:
  • May lose context
  • More chunks to manage

Best Practices

Match Content Type

Use markdown chunking for docs, code chunking for code, semantic for narratives.

Test Different Sizes

Experiment with 500, 1000, and 1500 character chunks to find optimal size.

Use 10-20% Overlap

Overlap maintains context across chunk boundaries without excessive duplication.

Extract Metadata

Add titles, summaries, and keywords to improve retrieval accuracy.

Troubleshooting

  • Increase maxSize
  • Reduce number of separators
  • Use character or token chunking for uniform sizes
  • Decrease maxSize
  • Add more separators to hierarchy
  • Use sentence chunking with maxSize limit
  • Increase overlap (try 20% of chunk size)
  • Use larger chunks
  • Extract summaries for each chunk
  • Experiment with different chunk sizes
  • Add metadata extraction
  • Try semantic or markdown chunking
  • Verify separator hierarchy matches content structure

Next Steps

Ingestion

Learn about document loading and preprocessing

Retrieval

Implement semantic search and reranking

RAG Overview

Return to RAG overview

Build docs developers (and LLMs) love