Learn how to use HybridChunker for tokenization-aware chunking and customize document serialization.
Part 1: Hybrid Chunking
Overview
Hybrid chunking applies tokenization-aware refinements on top of document-based hierarchical chunking. This ensures chunks:
Respect token limits for embedding models
Preserve document structure (headings, sections)
Merge undersized chunks when possible
Include contextual metadata
Installation
pip install docling transformers
Basic Chunking
from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker
# Convert document
doc = DocumentConverter().convert( "document.pdf" ).document
# Create chunker with defaults
chunker = HybridChunker()
chunk_iter = chunker.chunk( dl_doc = doc)
# Iterate chunks
for chunk in chunk_iter:
print ( f "Text: { chunk.text[: 100 ] } ..." )
# Get context-enriched text for embedding
enriched_text = chunker.contextualize( chunk = chunk)
print ( f "Enriched: { enriched_text[: 100 ] } ..." )
The contextualize() method adds section headings as context - use this text for embeddings.
Choose Tokenizer
Select HuggingFace or OpenAI tokenizer matching your embedding model.
Set Token Limits
Configure max tokens to match your embedding model’s context window.
Create Chunker
Instantiate HybridChunker with tokenizer configuration.
HuggingFace Tokenizer
OpenAI Tokenizer
from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer
from transformers import AutoTokenizer
from docling.chunking import HybridChunker
EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
MAX_TOKENS = 64
tokenizer = HuggingFaceTokenizer(
tokenizer = AutoTokenizer.from_pretrained( EMBED_MODEL_ID ),
max_tokens = MAX_TOKENS ,
)
chunker = HybridChunker(
tokenizer = tokenizer,
merge_peers = True , # Merge undersized peer chunks
)
chunks = list (chunker.chunk( dl_doc = doc))
Inspect Token Counts
for i, chunk in enumerate (chunks):
txt_tokens = tokenizer.count_tokens(chunk.text)
ser_tokens = tokenizer.count_tokens(chunker.contextualize(chunk))
print ( f "Chunk { i } :" )
print ( f " Raw text: { txt_tokens } tokens" )
print ( f " Contextualized: { ser_tokens } tokens" )
print ( f " Text: { chunk.text[: 100 ] } ..." )
Chunking Behavior
HybridChunker intelligently handles chunks:
Fits within limits : Chunks are sized to respect max_tokens
Merges peers : Undersized adjacent chunks are merged when possible
Preserves structure : Stops before token limit to avoid breaking mid-sentence
Handles oversized : Single items exceeding limit are kept intact
Part 2: Custom Serialization
Overview
Docling provides serializers to convert documents to different formats. You can:
Use built-in serializers (Markdown, HTML)
Configure serializer parameters
Create custom serializers
Customize component serialization (e.g., tables)
Basic Serialization
from docling.document_converter import DocumentConverter
from docling_core.transforms.serializer.html import HTMLDocSerializer
from docling_core.transforms.serializer.markdown import MarkdownDocSerializer
# Convert document
converter = DocumentConverter()
doc = converter.convert( "https://arxiv.org/pdf/2311.18481" ).document
# Serialize to HTML
html_serializer = HTMLDocSerializer( doc = doc)
html_result = html_serializer.serialize()
print (html_result.text[: 500 ])
# Serialize to Markdown
md_serializer = MarkdownDocSerializer( doc = doc)
md_result = md_serializer.serialize()
print (md_result.text[: 500 ])
Customize serializer behavior:
from docling_core.transforms.chunker.hierarchical_chunker import TripletTableSerializer
from docling_core.transforms.serializer.markdown import (
MarkdownDocSerializer,
MarkdownParams
)
serializer = MarkdownDocSerializer(
doc = doc,
table_serializer = TripletTableSerializer(), # Use triplet format for tables
params = MarkdownParams(
image_placeholder = "<!-- custom image -->" ,
# Additional parameters...
),
)
result = serializer.serialize()
print (result.text[: 500 ])
Table Serialization Options
Markdown Tables (Default)
Triplet Format
from docling_core.transforms.serializer.markdown import MarkdownDocSerializer
serializer = MarkdownDocSerializer( doc = doc)
result = serializer.serialize()
# Output:
# | Column1 | Column2 |
# |---------|----------|
# | Value1 | Value2 |
Custom Serializer Example
Create a custom component serializer:
from docling_core.transforms.serializer.markdown import MarkdownDocSerializer
from docling_core.types.doc import PictureItem
class CustomPictureSerializer :
"""Include picture descriptions in serialization."""
def serialize ( self , item : PictureItem, doc ) -> str :
text = f " \n "
# Add caption if available
if item.caption:
text += f "Caption: { item.caption.text } \n "
# Add description annotation if available
if hasattr (item, 'description' ):
text += f "Description: { item.description } \n "
return text
# Use custom serializer
serializer = MarkdownDocSerializer(
doc = doc,
picture_serializer = CustomPictureSerializer(),
)
result = serializer.serialize()
Complete Example
from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker
from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer
from docling_core.transforms.serializer.markdown import (
MarkdownDocSerializer,
MarkdownParams
)
from transformers import AutoTokenizer
# Convert document
converter = DocumentConverter()
doc = converter.convert( "document.pdf" ).document
# Configure chunker
EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = HuggingFaceTokenizer(
tokenizer = AutoTokenizer.from_pretrained( EMBED_MODEL_ID ),
max_tokens = 256 ,
)
chunker = HybridChunker( tokenizer = tokenizer)
chunks = list (chunker.chunk( dl_doc = doc))
print ( f "Created { len (chunks) } chunks" )
# Process chunks
for i, chunk in enumerate (chunks[: 3 ]):
enriched = chunker.contextualize(chunk)
tokens = tokenizer.count_tokens(enriched)
print ( f " \n Chunk { i } ( { tokens } tokens):" )
print (enriched[: 200 ])
# Custom serialization
serializer = MarkdownDocSerializer(
doc = doc,
params = MarkdownParams(
image_placeholder = "[Image]" ,
strict_text = False ,
),
)
result = serializer.serialize()
print ( f " \n Serialized to { len (result.text) } characters" )
Chunking Parameters
tokenizer : Tokenizer for counting tokens (HuggingFace or OpenAI)
max_tokens : Maximum tokens per chunk
merge_peers : Whether to merge undersized adjacent chunks (default: True)
Serialization Parameters
table_serializer : Custom table serialization strategy
picture_serializer : Custom picture serialization strategy
params : Format-specific parameters (image placeholders, text mode, etc.)
Requirements
# For HuggingFace tokenizers
pip install docling transformers
# For OpenAI tokenizers
pip install docling-core[chunking-openai]
# For rich console output (optional)
pip install rich
HybridChunker may trigger a “token indices sequence length” warning from transformers. This is a false alarm and can be safely ignored.
Best Practices
Match tokenizers : Use the same tokenizer for chunking and embedding
Set appropriate limits : Match max_tokens to your embedding model’s context window
Use contextualize() : Always use chunker.contextualize(chunk) for embeddings
Choose serializers wisely : Triplet format may improve vector representation for tables
Test chunk sizes : Verify chunks fit within your pipeline’s constraints