Chunking is the process of splitting documents into smaller, semantically meaningful pieces. Proper chunking is critical for RAG performance - chunks that are too large lose precision, while chunks that are too small lose context.
Chunking Strategies
Mastra provides multiple strategies optimized for different content types:
Recursive Hierarchically splits text using multiple separators. Best for general content.
Markdown Preserves markdown structure and headers. Ideal for documentation.
HTML Respects HTML structure and sections. Use for web content.
Semantic Groups semantically related content. Best for narrative text.
Code Language-aware splitting. Preserves code structure.
JSON Recursive JSON splitting. Handles nested structures.
Recursive Chunking (Default)
Recursive chunking splits text using a hierarchy of separators:
import { MDocument } from '@mastra/rag' ;
const doc = MDocument . fromText ( content );
const chunks = await doc . chunk ({
strategy: 'recursive' ,
maxSize: 1000 , // Maximum characters per chunk
overlap: 100 , // Characters of overlap between chunks
separatorPosition: 'end' // Where to place separator ('start' or 'end')
});
Maximum chunk size in characters
Number of characters to overlap between chunks
Custom separator hierarchy. Defaults to ['\n\n', '\n', ' ', '']
separatorPosition
'start' | 'end'
default: "'end'"
Where to place the separator in chunks
Language-Aware Chunking
For code files, specify the language for syntax-aware chunking:
const doc = MDocument . fromText ( typescriptCode );
const chunks = await doc . chunk ({
strategy: 'recursive' ,
language: 'typescript' ,
maxSize: 1500 ,
overlap: 150
});
Supported languages:
TypeScript/JavaScript
Python
Go
Java
Rust
C++
language : 'typescript' // or 'js'
See the Language enum for the full list.
Markdown Chunking
Preserve markdown structure and hierarchy:
const doc = MDocument . fromMarkdown ( content );
const chunks = await doc . chunk ({
strategy: 'markdown' ,
maxSize: 1000 ,
overlap: 100 ,
headers: [
[ '#' , 'h1' ],
[ '##' , 'h2' ],
[ '###' , 'h3' ]
],
returnEachLine: false ,
stripHeaders: false
});
Header patterns to preserve: [markdown_prefix, header_name]
Return each line as a separate chunk
Remove header markers from chunk text
Markdown chunking adds header hierarchy to metadata:
const chunks = await doc . chunk ({
strategy: 'markdown' ,
headers: [[ '#' , 'h1' ], [ '##' , 'h2' ]]
});
// Chunk metadata includes headers
console . log ( chunks [ 0 ]. metadata );
/*
{
h1: "Getting Started",
h2: "Installation",
startIndex: 0
}
*/
HTML Chunking
Split HTML by semantic sections:
const doc = MDocument . fromHTML ( htmlContent );
// By headers
const chunks = await doc . chunk ({
strategy: 'html' ,
headers: [
[ 'h1' , 'Header 1' ],
[ 'h2' , 'Header 2' ]
],
maxSize: 1000
});
// By sections
const chunks = await doc . chunk ({
strategy: 'html' ,
sections: [
[ 'article' , 'Article' ],
[ 'section' , 'Section' ],
[ 'div' , 'Division' ]
],
maxSize: 1000
});
Semantic Markdown Chunking
Group semantically related content:
const doc = MDocument . fromMarkdown ( content );
const chunks = await doc . chunk ({
strategy: 'semantic-markdown' ,
maxSize: 800 ,
overlap: 80 ,
joinThreshold: 0.5 , // Semantic similarity threshold
modelName: 'gpt-4' , // For token counting
encodingName: 'cl100k_base'
});
Semantic similarity threshold (0-1) for joining chunks
Encoding for token counting
JSON Chunking
Handle nested JSON structures:
const doc = MDocument . fromJSON ( jsonString );
const chunks = await doc . chunk ({
strategy: 'json' ,
maxSize: 2000 ,
minSize: 500 ,
ensureAscii: false ,
convertLists: true
});
Maximum chunk size (required for JSON)
Minimum chunk size before splitting
Escape non-ASCII characters
Convert lists to separate chunks
Token-Based Chunking
Chunk by token count instead of characters:
const doc = MDocument . fromText ( content );
const chunks = await doc . chunk ({
strategy: 'token' ,
maxSize: 512 , // Max tokens
overlap: 50 , // Token overlap
modelName: 'gpt-4' ,
encodingName: 'cl100k_base'
});
Token-based chunking is useful when you need precise token counts for embedding models with token limits.
Sentence Chunking
Split by sentences while respecting size limits:
const doc = MDocument . fromText ( content );
const chunks = await doc . chunk ({
strategy: 'sentence' ,
maxSize: 1000 ,
minSize: 200 ,
targetSize: 500 ,
sentenceEnders: [ '.' , '!' , '?' ],
fallbackToWords: true ,
fallbackToCharacters: true
});
Target chunk size to aim for
Characters that end sentences
Fall back to word splitting if sentences are too long
Fall back to character splitting if words are too long
Character Chunking
Simple splitting by separator:
const doc = MDocument . fromText ( content );
const chunks = await doc . chunk ({
strategy: 'character' ,
maxSize: 1000 ,
separator: ' \n\n ' ,
isSeparatorRegex: false
});
Chunk Overlap
Overlap maintains context between chunks:
const chunks = await doc . chunk ({
strategy: 'recursive' ,
maxSize: 1000 ,
overlap: 150 // 15% overlap
});
// Chunks have overlapping content
// Chunk 1: characters 0-1000
// Chunk 2: characters 850-1850 (150 char overlap)
// Chunk 3: characters 1700-2700 (150 char overlap)
Recommended overlap : 10-20% of chunk size. For example, with 1000 char chunks, use 100-200 overlap.
Extract metadata during chunking:
const doc = MDocument . fromMarkdown ( content );
const chunks = await doc . chunk ({
strategy: 'markdown' ,
maxSize: 1000 ,
extract: {
title: true ,
summary: {
model: 'openai/gpt-4o-mini' ,
maxTokens: 100
},
keywords: {
model: 'openai/gpt-4o-mini' ,
maxKeywords: 5
},
questions: {
model: 'openai/gpt-4o-mini' ,
maxQuestions: 3
}
}
});
// Chunks have enriched metadata
console . log ( chunks [ 0 ]. metadata . title );
console . log ( chunks [ 0 ]. metadata . summary );
console . log ( chunks [ 0 ]. metadata . keywords );
console . log ( chunks [ 0 ]. metadata . questions );
Metadata extraction runs after chunking, enriching each chunk with AI-generated context.
Custom Length Functions
Use custom length calculations:
import { countTokens } from './tokenizer' ;
const chunks = await doc . chunk ({
strategy: 'recursive' ,
maxSize: 500 ,
lengthFunction : ( text ) => countTokens ( text )
});
Complete Chunking Pipeline
Put it all together:
import { MDocument } from '@mastra/rag' ;
import { readFile } from 'fs/promises' ;
// 1. Load document
const content = await readFile ( 'docs/guide.md' , 'utf-8' );
const doc = MDocument . fromMarkdown ( content , {
source: 'docs/guide.md' ,
category: 'documentation' ,
version: '1.0'
});
// 2. Chunk with metadata extraction
const chunks = await doc . chunk ({
strategy: 'markdown' ,
maxSize: 1000 ,
overlap: 100 ,
headers: [
[ '#' , 'h1' ],
[ '##' , 'h2' ],
[ '###' , 'h3' ]
],
extract: {
title: true ,
summary: { model: 'openai/gpt-4o-mini' },
keywords: { maxKeywords: 5 }
}
});
console . log ( `Created ${ chunks . length } chunks` );
// 3. Access chunk data
chunks . forEach (( chunk , i ) => {
console . log ( ` \n Chunk ${ i + 1 } :` );
console . log ( `Title: ${ chunk . metadata . title } ` );
console . log ( `Summary: ${ chunk . metadata . summary } ` );
console . log ( `Keywords: ${ chunk . metadata . keywords ?. join ( ', ' ) } ` );
console . log ( `Text length: ${ chunk . text . length } chars` );
});
Choosing Chunk Size
Optimal chunk size depends on your use case:
Small (200-500)
Medium (500-1000)
Large (1000-2000)
Best for:
Precise information retrieval
Question answering
Fact extraction
Pros:
High precision
Lower token usage
Cons:
May lose context
More chunks to manage
Best for:
General documentation
Tutorial content
Product descriptions
Pros:
Good balance
Maintains context
Cons: Recommended starting point Best for:
Long-form content
Narrative text
Research papers
Pros:
Rich context
Fewer chunks
Cons:
Lower precision
Higher token usage
Best Practices
Match Content Type Use markdown chunking for docs, code chunking for code, semantic for narratives.
Test Different Sizes Experiment with 500, 1000, and 1500 character chunks to find optimal size.
Use 10-20% Overlap Overlap maintains context across chunk boundaries without excessive duplication.
Extract Metadata Add titles, summaries, and keywords to improve retrieval accuracy.
Troubleshooting
Increase maxSize
Reduce number of separators
Use character or token chunking for uniform sizes
Decrease maxSize
Add more separators to hierarchy
Use sentence chunking with maxSize limit
Lost context between chunks
Increase overlap (try 20% of chunk size)
Use larger chunks
Extract summaries for each chunk
Experiment with different chunk sizes
Add metadata extraction
Try semantic or markdown chunking
Verify separator hierarchy matches content structure
Next Steps
Ingestion Learn about document loading and preprocessing
Retrieval Implement semantic search and reranking
RAG Overview Return to RAG overview