Document ingestion is the first step in building a RAG system. Mastra’s MDocument class provides a unified interface for loading documents from various formats and preparing them for chunking and embedding.
Creating Documents
Mastra supports multiple document formats through static factory methods:
import { MDocument } from '@mastra/rag' ;
const doc = MDocument . fromText (
'This is plain text content' ,
{ source: 'document.txt' , category: 'general' }
);
Document Structure
The MDocument class manages documents and their metadata:
class MDocument {
// Create from various formats
static fromText ( text : string , metadata ?: Record < string , any >) : MDocument
static fromHTML ( html : string , metadata ?: Record < string , any >) : MDocument
static fromMarkdown ( markdown : string , metadata ?: Record < string , any >) : MDocument
static fromJSON ( json : string , metadata ?: Record < string , any >) : MDocument
// Chunk the document
async chunk ( params ?: ChunkParams ) : Promise < Chunk []>
// Extract metadata
async extractMetadata ( params : ExtractParams ) : Promise < MDocument >
// Access document data
getDocs () : Chunk []
getText () : string []
getMetadata () : Record < string , any >[]
}
Loading from Files
Load documents from the filesystem:
import { readFile } from 'fs/promises' ;
import { MDocument } from '@mastra/rag' ;
// Load text file
const textContent = await readFile ( 'docs/intro.txt' , 'utf-8' );
const textDoc = MDocument . fromText ( textContent , {
source: 'intro.txt' ,
category: 'documentation'
});
// Load markdown file
const mdContent = await readFile ( 'docs/README.md' , 'utf-8' );
const mdDoc = MDocument . fromMarkdown ( mdContent , {
source: 'README.md' ,
category: 'documentation'
});
// Load JSON file
const jsonContent = await readFile ( 'data/config.json' , 'utf-8' );
const jsonDoc = MDocument . fromJSON ( jsonContent , {
source: 'config.json' ,
category: 'configuration'
});
Batch Document Processing
Process multiple documents efficiently:
import { glob } from 'glob' ;
import { readFile } from 'fs/promises' ;
import { MDocument } from '@mastra/rag' ;
async function ingestDocuments ( pattern : string ) {
const files = await glob ( pattern );
const documents : MDocument [] = [];
for ( const file of files ) {
const content = await readFile ( file , 'utf-8' );
let doc : MDocument ;
if ( file . endsWith ( '.md' )) {
doc = MDocument . fromMarkdown ( content , {
source: file ,
type: 'markdown'
});
} else if ( file . endsWith ( '.html' )) {
doc = MDocument . fromHTML ( content , {
source: file ,
type: 'html'
});
} else {
doc = MDocument . fromText ( content , {
source: file ,
type: 'text'
});
}
documents . push ( doc );
}
return documents ;
}
// Usage
const docs = await ingestDocuments ( 'docs/**/*.md' );
console . log ( `Loaded ${ docs . length } documents` );
Enrich documents with AI-generated metadata before chunking:
const doc = MDocument . fromMarkdown ( content );
// Extract metadata
await doc . extractMetadata ({
title: true ,
summary: {
model: 'openai/gpt-4o-mini' ,
maxTokens: 100
},
keywords: {
model: 'openai/gpt-4o-mini' ,
maxKeywords: 5
},
questions: {
model: 'openai/gpt-4o-mini' ,
maxQuestions: 3
}
});
// Chunk with extracted metadata
const chunks = await doc . chunk ({
strategy: 'recursive' ,
maxSize: 500
});
// Metadata is preserved in chunks
console . log ( chunks [ 0 ]. metadata . title );
console . log ( chunks [ 0 ]. metadata . summary );
console . log ( chunks [ 0 ]. metadata . keywords );
Metadata extraction happens before chunking and is applied to all resulting chunks.
Extract structured data using a schema:
import { z } from 'zod' ;
const doc = MDocument . fromText ( content );
const schema = z . object ({
product: z . string (),
price: z . number (),
category: z . string (),
inStock: z . boolean ()
});
await doc . extractMetadata ({
schema: {
schema ,
model: 'openai/gpt-4o'
}
});
const chunks = await doc . chunk ({ strategy: 'recursive' });
// Structured data in metadata
console . log ( chunks [ 0 ]. metadata . product );
console . log ( chunks [ 0 ]. metadata . price );
Web Scraping
Ingest documents from web pages:
import { MDocument } from '@mastra/rag' ;
async function scrapeWebPage ( url : string ) {
const response = await fetch ( url );
const html = await response . text ();
const doc = MDocument . fromHTML ( html , {
source: url ,
scrapedAt: new Date (). toISOString (),
url
});
return doc ;
}
// Usage
const doc = await scrapeWebPage ( 'https://docs.example.com/intro' );
const chunks = await doc . chunk ({
strategy: 'html' ,
sections: [
[ 'article' , 'content' ],
[ 'section' , 'section' ]
]
});
Custom Document Types
Handle custom document formats:
import { MDocument } from '@mastra/rag' ;
// Custom parser
function parseCustomFormat ( content : string ) {
// Your parsing logic
return {
text: content ,
metadata: {
format: 'custom' ,
sections: []
}
};
}
// Create document
const rawContent = await readFile ( 'doc.custom' , 'utf-8' );
const parsed = parseCustomFormat ( rawContent );
const doc = MDocument . fromText ( parsed . text , parsed . metadata );
Standard metadata fields for documents:
type DocumentMetadata = {
// Required
source : string ; // File path or URL
// Recommended
category ?: string ; // Document category
type ?: string ; // Document type (text, code, etc)
version ?: string ; // Document version
author ?: string ; // Author name
createdAt ?: string ; // ISO timestamp
updatedAt ?: string ; // ISO timestamp
// Optional
language ?: string ; // Content language
tags ?: string []; // Tags for filtering
url ?: string ; // Original URL
title ?: string ; // Document title
description ?: string ; // Brief description
// Custom fields
[ key : string ] : any ; // Additional metadata
};
Complete Ingestion Pipeline
Put it all together:
import { MDocument } from '@mastra/rag' ;
import { PgVector } from '@mastra/vector-pg' ;
import { openai } from '@ai-sdk/openai' ;
import { glob } from 'glob' ;
import { readFile } from 'fs/promises' ;
const vectorStore = new PgVector ({
connectionString: process . env . DATABASE_URL
});
const embedder = openai . embedding ( 'text-embedding-3-small' );
async function ingestPipeline () {
// 1. Load documents
const files = await glob ( 'docs/**/*.md' );
for ( const file of files ) {
console . log ( `Processing ${ file } ...` );
// 2. Create document
const content = await readFile ( file , 'utf-8' );
const doc = MDocument . fromMarkdown ( content , {
source: file ,
category: 'documentation' ,
ingestedAt: new Date (). toISOString ()
});
// 3. Extract metadata
await doc . extractMetadata ({
title: true ,
summary: { model: 'openai/gpt-4o-mini' },
keywords: { maxKeywords: 5 }
});
// 4. Chunk document
const chunks = await doc . chunk ({
strategy: 'markdown' ,
maxSize: 1000 ,
overlap: 100
});
console . log ( ` Created ${ chunks . length } chunks` );
// 5. Create embeddings
for ( const chunk of chunks ) {
const result = await embedder . doEmbed ({
values: [ chunk . text ]
});
// 6. Store in vector DB
await vectorStore . upsert ({
indexName: 'documentation' ,
vectors: result . embeddings ,
ids: [ chunk . id ],
metadata: [{
text: chunk . text ,
... chunk . metadata
}]
});
}
}
console . log ( 'Ingestion complete!' );
}
// Run pipeline
await ingestPipeline ();
Error Handling
Handle errors gracefully during ingestion:
async function safeIngest ( file : string ) {
try {
const content = await readFile ( file , 'utf-8' );
const doc = MDocument . fromMarkdown ( content , { source: file });
const chunks = await doc . chunk ({ strategy: 'recursive' , maxSize: 500 });
return { success: true , chunks };
} catch ( error ) {
console . error ( `Failed to process ${ file } :` , error );
return { success: false , error };
}
}
const files = await glob ( 'docs/**/*.md' );
const results = await Promise . allSettled (
files . map ( file => safeIngest ( file ))
);
const successful = results . filter ( r => r . status === 'fulfilled' );
const failed = results . filter ( r => r . status === 'rejected' );
console . log ( `Processed: ${ successful . length } succeeded, ${ failed . length } failed` );
Best Practices
Add Rich Metadata Include source, category, timestamps, and custom fields for filtering and debugging.
Extract Before Chunking Run metadata extraction on full documents before chunking for better context.
Batch Processing Process documents in batches to optimize embedding API calls and reduce costs.
Handle Errors Implement error handling and logging to track ingestion failures.
Next Steps
Chunking Learn about document chunking strategies
Retrieval Implement semantic search and reranking
RAG Overview Return to RAG overview