A chunker is a Docling abstraction that takes a DoclingDocument and returns a stream of chunks. Each chunk captures a portion of the document as text accompanied by metadata.Chunkers enable:
Flexibility: Customize chunking strategies for specific use cases
Out-of-the-box utility: Built-in implementations for common patterns
Framework integration: Easy integration with LlamaIndex, LangChain, etc.
All chunkers implement the BaseChunker base class:
from docling_core.transforms.chunker.base import BaseChunker, BaseChunkfrom docling_core.types.doc import DoclingDocumentfrom typing import Iteratorclass BaseChunker: def chunk(self, dl_doc: DoclingDocument, **kwargs) -> Iterator[BaseChunk]: """Return chunks for the provided document.""" pass def contextualize(self, chunk: BaseChunk) -> str: """Return metadata-enriched serialization of the chunk. Typically used to feed an embedding model or generation model. """ pass
Purpose: Create one chunk per document element using document structure.Implementation: Uses the hierarchical structure from DoclingDocument to create chunks.Features:
One chunk per document element (paragraph, table, etc.)
Preserves hierarchy through metadata
Optionally merges list items into single chunks
Attaches headers and captions to chunks
Usage:
from docling.chunking import HierarchicalChunkerfrom docling.document_converter import DocumentConverterconverter = DocumentConverter()result = converter.convert("document.pdf")chunker = HierarchicalChunker( merge_list_items=True # Merge list items into single chunks (default: True))for chunk in chunker.chunk(result.document): print(f"Text: {chunk.text}") print(f"Headings: {chunk.meta.headings}") print(f"Page: {chunk.meta.page_no}") print("-" * 80)
Metadata included:
Document headings (section hierarchy)
Table and figure captions
Page numbers
Hierarchical path in document structure
Best for:
Preserving document structure
Fine-grained retrieval
When document elements naturally form semantic units
Purpose: Tokenization-aware chunking with hierarchical refinement.Implementation: Builds on HierarchicalChunker and applies token-based splitting and merging.Features:
Starts from hierarchical chunks
Splits oversized chunks based on token count
Merges undersized successive chunks with same headings/captions
Respects max/min token boundaries
Supports both HuggingFace and OpenAI tokenizers
Usage:
from docling.chunking import HybridChunkerfrom docling.document_converter import DocumentConverterfrom transformers import AutoTokenizerconverter = DocumentConverter()result = converter.convert("document.pdf")# Using HuggingFace tokenizertokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")chunker = HybridChunker( tokenizer=tokenizer, max_tokens=512, # Maximum tokens per chunk min_tokens=64, # Minimum tokens (for merging) merge_peers=True # Merge successive chunks with same metadata)for chunk in chunker.chunk(result.document): # Get contextualized text (with metadata) context_text = chunker.contextualize(chunk) # Use for embedding embedding = embed_model.encode(context_text)
chunker = HybridChunker(tokenizer=tokenizer, max_tokens=512)# Only chunks from specific sectionsfor chunk in chunker.chunk(doc): if "Introduction" in chunk.meta.headings: continue # Skip introduction chunks process(chunk)
class CustomHybridChunker(HybridChunker): def contextualize(self, chunk: BaseChunk) -> str: # Custom contextualization with limited heading depth headings = chunk.meta.headings[:2] # Only top 2 levels context_parts = [f"## {h}" for h in headings] context_parts.append(chunk.text) return "\n".join(context_parts)
For complex workflows, combine chunking with custom serialization:
from docling_core.transforms.serializer.markdown import MarkdownDocSerializerserializer = MarkdownDocSerializer(doc=doc)markdown, _ = serializer.serialize()# Now apply text-based chunking to markdown# Or use Docling chunkers for structure-aware chunkingchunker = HybridChunker(tokenizer=tokenizer, max_tokens=512)for chunk in chunker.chunk(doc): # Chunks preserve structure from DoclingDocument process(chunk)
converter = DocumentConverter()chunker = HybridChunker(tokenizer=tokenizer, max_tokens=512)all_chunks = []for doc_path in doc_paths: result = converter.convert(doc_path) chunks = list(chunker.chunk(result.document)) all_chunks.extend(chunks)# Batch embed all chunksembeddings = embed_model.encode([chunker.contextualize(c) for c in all_chunks])