Skip to main content
Learn how to use HybridChunker for tokenization-aware chunking and customize document serialization.

Part 1: Hybrid Chunking

Overview

Hybrid chunking applies tokenization-aware refinements on top of document-based hierarchical chunking. This ensures chunks:
  • Respect token limits for embedding models
  • Preserve document structure (headings, sections)
  • Merge undersized chunks when possible
  • Include contextual metadata

Installation

pip install docling transformers

Basic Chunking

hybrid_chunking.ipynb
from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker

# Convert document
doc = DocumentConverter().convert("document.pdf").document

# Create chunker with defaults
chunker = HybridChunker()
chunk_iter = chunker.chunk(dl_doc=doc)

# Iterate chunks
for chunk in chunk_iter:
    print(f"Text: {chunk.text[:100]}...")
    
    # Get context-enriched text for embedding
    enriched_text = chunker.contextualize(chunk=chunk)
    print(f"Enriched: {enriched_text[:100]}...")
The contextualize() method adds section headings as context - use this text for embeddings.

Configure Tokenization

1

Choose Tokenizer

Select HuggingFace or OpenAI tokenizer matching your embedding model.
2

Set Token Limits

Configure max tokens to match your embedding model’s context window.
3

Create Chunker

Instantiate HybridChunker with tokenizer configuration.
from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer
from transformers import AutoTokenizer
from docling.chunking import HybridChunker

EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
MAX_TOKENS = 64

tokenizer = HuggingFaceTokenizer(
    tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),
    max_tokens=MAX_TOKENS,
)

chunker = HybridChunker(
    tokenizer=tokenizer,
    merge_peers=True,  # Merge undersized peer chunks
)

chunks = list(chunker.chunk(dl_doc=doc))

Inspect Token Counts

for i, chunk in enumerate(chunks):
    txt_tokens = tokenizer.count_tokens(chunk.text)
    ser_tokens = tokenizer.count_tokens(chunker.contextualize(chunk))
    
    print(f"Chunk {i}:")
    print(f"  Raw text: {txt_tokens} tokens")
    print(f"  Contextualized: {ser_tokens} tokens")
    print(f"  Text: {chunk.text[:100]}...")

Chunking Behavior

HybridChunker intelligently handles chunks:
  • Fits within limits: Chunks are sized to respect max_tokens
  • Merges peers: Undersized adjacent chunks are merged when possible
  • Preserves structure: Stops before token limit to avoid breaking mid-sentence
  • Handles oversized: Single items exceeding limit are kept intact

Part 2: Custom Serialization

Overview

Docling provides serializers to convert documents to different formats. You can:
  • Use built-in serializers (Markdown, HTML)
  • Configure serializer parameters
  • Create custom serializers
  • Customize component serialization (e.g., tables)

Basic Serialization

serialization.ipynb
from docling.document_converter import DocumentConverter
from docling_core.transforms.serializer.html import HTMLDocSerializer
from docling_core.transforms.serializer.markdown import MarkdownDocSerializer

# Convert document
converter = DocumentConverter()
doc = converter.convert("https://arxiv.org/pdf/2311.18481").document

# Serialize to HTML
html_serializer = HTMLDocSerializer(doc=doc)
html_result = html_serializer.serialize()
print(html_result.text[:500])

# Serialize to Markdown
md_serializer = MarkdownDocSerializer(doc=doc)
md_result = md_serializer.serialize()
print(md_result.text[:500])

Configure Serializers

Customize serializer behavior:
from docling_core.transforms.chunker.hierarchical_chunker import TripletTableSerializer
from docling_core.transforms.serializer.markdown import (
    MarkdownDocSerializer,
    MarkdownParams
)

serializer = MarkdownDocSerializer(
    doc=doc,
    table_serializer=TripletTableSerializer(),  # Use triplet format for tables
    params=MarkdownParams(
        image_placeholder="<!-- custom image -->",
        # Additional parameters...
    ),
)

result = serializer.serialize()
print(result.text[:500])

Table Serialization Options

from docling_core.transforms.serializer.markdown import MarkdownDocSerializer

serializer = MarkdownDocSerializer(doc=doc)
result = serializer.serialize()

# Output:
# | Column1 | Column2 |
# |---------|----------|
# | Value1  | Value2  |

Custom Serializer Example

Create a custom component serializer:
from docling_core.transforms.serializer.markdown import MarkdownDocSerializer
from docling_core.types.doc import PictureItem

class CustomPictureSerializer:
    """Include picture descriptions in serialization."""
    
    def serialize(self, item: PictureItem, doc) -> str:
        text = f"![Picture]({item.image.uri})\n"
        
        # Add caption if available
        if item.caption:
            text += f"Caption: {item.caption.text}\n"
        
        # Add description annotation if available
        if hasattr(item, 'description'):
            text += f"Description: {item.description}\n"
        
        return text

# Use custom serializer
serializer = MarkdownDocSerializer(
    doc=doc,
    picture_serializer=CustomPictureSerializer(),
)
result = serializer.serialize()

Complete Example

from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker
from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer
from docling_core.transforms.serializer.markdown import (
    MarkdownDocSerializer,
    MarkdownParams
)
from transformers import AutoTokenizer

# Convert document
converter = DocumentConverter()
doc = converter.convert("document.pdf").document

# Configure chunker
EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = HuggingFaceTokenizer(
    tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),
    max_tokens=256,
)

chunker = HybridChunker(tokenizer=tokenizer)
chunks = list(chunker.chunk(dl_doc=doc))

print(f"Created {len(chunks)} chunks")

# Process chunks
for i, chunk in enumerate(chunks[:3]):
    enriched = chunker.contextualize(chunk)
    tokens = tokenizer.count_tokens(enriched)
    print(f"\nChunk {i} ({tokens} tokens):")
    print(enriched[:200])

# Custom serialization
serializer = MarkdownDocSerializer(
    doc=doc,
    params=MarkdownParams(
        image_placeholder="[Image]",
        strict_text=False,
    ),
)

result = serializer.serialize()
print(f"\nSerialized to {len(result.text)} characters")

Chunking Parameters

  • tokenizer: Tokenizer for counting tokens (HuggingFace or OpenAI)
  • max_tokens: Maximum tokens per chunk
  • merge_peers: Whether to merge undersized adjacent chunks (default: True)

Serialization Parameters

  • table_serializer: Custom table serialization strategy
  • picture_serializer: Custom picture serialization strategy
  • params: Format-specific parameters (image placeholders, text mode, etc.)

Requirements

# For HuggingFace tokenizers
pip install docling transformers

# For OpenAI tokenizers
pip install docling-core[chunking-openai]

# For rich console output (optional)
pip install rich
HybridChunker may trigger a “token indices sequence length” warning from transformers. This is a false alarm and can be safely ignored.

Best Practices

  1. Match tokenizers: Use the same tokenizer for chunking and embedding
  2. Set appropriate limits: Match max_tokens to your embedding model’s context window
  3. Use contextualize(): Always use chunker.contextualize(chunk) for embeddings
  4. Choose serializers wisely: Triplet format may improve vector representation for tables
  5. Test chunk sizes: Verify chunks fit within your pipeline’s constraints

Build docs developers (and LLMs) love