Split documents into chunks with language-aware parsing
The TextSplitter class intelligently splits documents into smaller chunks while respecting programming language syntax. It uses language-specific parsers when available and falls back to generic text splitting for unknown file types.
List of chunk Document objects with preserved metadata from the original documents
The method automatically detects file extensions from document metadata and applies the appropriate language-specific splitter. For files without recognized extensions, it falls back to generic recursive character splitting.
from src.rag.text_splitter import TextSplitter# Assume 'docs' is a list of Document objects from GitHubCodeBaseLoadersplitter = TextSplitter( documents=docs, chunk_size=1000, chunk_overlap=200)chunks = splitter.split_documents_into_chunks()print(f"Split {len(docs)} documents into {len(chunks)} chunks")
# Larger chunks for more context (e.g., for embeddings with larger context windows)splitter = TextSplitter( documents=docs, chunk_size=2000, chunk_overlap=400)# Smaller chunks for more granular retrievalsplitter = TextSplitter( documents=docs, chunk_size=500, chunk_overlap=100)