Installation
Core Concepts
All text splitters extend theTextSplitter base class and implement the splitText() method. They support:
- Chunk Size: Maximum size of each chunk (default: 1000)
- Chunk Overlap: Number of characters to overlap between chunks (default: 200)
- Length Function: Custom function to measure text length
- Separator Preservation: Keep or remove separators when splitting
Available Splitters
CharacterTextSplitter
Splits text based on a single separator character or string.separator(string): The string to split on (default:"\n\n")chunkSize(number): Maximum chunk sizechunkOverlap(number): Overlap between chunkskeepSeparator(boolean): Whether to keep the separator in chunks
RecursiveCharacterTextSplitter
The most versatile splitter that tries multiple separators in order, recursively splitting until chunks are small enough."cpp"- C++"go"- Go"java"- Java"js"- JavaScript/TypeScript"php"- PHP"proto"- Protocol Buffers"python"- Python"rst"- reStructuredText"ruby"- Ruby"rust"- Rust"scala"- Scala"swift"- Swift"markdown"- Markdown"latex"- LaTeX"html"- HTML"sol"- Solidity
TokenTextSplitter
Splits text based on token count using tiktoken encoding.encodingName(TiktokenEncoding): Tokenizer to use (default:"gpt2")allowedSpecial(“all” | string[]): Special tokens to allowdisallowedSpecial(“all” | string[]): Special tokens to disallow
MarkdownTextSplitter
Specialized splitter for Markdown documents that respects document structure.- Headings (## through ######)
- Code blocks
- Horizontal rules
- Paragraphs
LatexTextSplitter
Specialized splitter for LaTeX documents.- Sections and subsections
- Environments (enumerate, itemize, etc.)
- Math environments
Working with Documents
Splitting Documents
Text splitters can work directly with Document objects:Creating Documents from Text
You can also create documents directly from text arrays:Chunk Headers
Add headers and overlap indicators to chunks:Custom Length Functions
Provide a custom function to measure text length:Line Number Tracking
Text splitters automatically track line numbers in metadata:- Code analysis and debugging
- Citation and reference tracking
- Maintaining document structure
Best Practices
Choosing Chunk Size
- Small chunks (200-500): Better for precise retrieval, more overhead
- Medium chunks (500-1000): Good balance for most use cases
- Large chunks (1000-2000): More context, less precise retrieval
Choosing Overlap
- Use 10-20% of chunk size as overlap
- Larger overlap helps preserve context across boundaries
- Too much overlap increases storage and processing costs
Choosing a Splitter
- RecursiveCharacterTextSplitter: Default choice for most text
- MarkdownTextSplitter: Use for Markdown documents to preserve structure
- TokenTextSplitter: When token count matters (e.g., for LLM input)
- CharacterTextSplitter: Simple use cases with clear separators
- Language-specific: Use
.fromLanguage()for code files
Document Transformers
All text splitters extendBaseDocumentTransformer, so they work in transformation pipelines:
Common Patterns
RAG Pipeline
Code Analysis
API Reference
TextSplitter (Base Class)
Properties:chunkSize: number- Maximum size of each chunkchunkOverlap: number- Number of characters to overlapkeepSeparator: boolean- Whether to keep separators in outputlengthFunction: (text: string) => number | Promise<number>- Function to measure text length
splitText(text: string): Promise<string[]>- Split text into chunkssplitDocuments(documents: Document[], options?: ChunkHeaderOptions): Promise<Document[]>- Split documentscreateDocuments(texts: string[], metadatas?: Record<string, any>[], options?: ChunkHeaderOptions): Promise<Document[]>- Create documents from textstransformDocuments(documents: Document[], options?: ChunkHeaderOptions): Promise<Document[]>- Transform documents (alias for splitDocuments)
TextSplitterChunkHeaderOptions
Related
- Document Loaders - Load documents before splitting
- Vector Stores - Store split documents for retrieval
- Retrievers - Retrieve relevant document chunks
- Documents - Document object structure
