Skip to main content

Introduction

A document serializer (also called simply serializer) is a Docling abstraction that takes a DoclingDocument and produces a textual representation in a specific format. Besides document-level serialization, Docling defines serializers for specific document components:
  • Document serializer: Complete document output
  • Text serializer: Text item formatting
  • Table serializer: Table formatting (Markdown, HTML, etc.)
  • Picture serializer: Image representation
  • List serializer: List formatting
  • Inline serializer: Inline elements (bold, italic, links)

Serializer Architecture

Base Classes

Docling defines a hierarchy of serializer base classes:
  • BaseDocSerializer: Document-level serialization
  • BaseTextSerializer: Text item serialization
  • BaseTableSerializer: Table serialization
  • BasePictureSerializer: Picture serialization
  • BaseListSerializer: List serialization
  • BaseInlineSerializer: Inline element serialization
Source: docling-core serializer base classes

Key Method

The primary method for all document serializers:
class BaseDocSerializer(ABC):
    @abstractmethod
    def serialize(self, **kwargs) -> tuple[str, dict]:
        """Serialize the document.
        
        Returns:
            tuple: (serialized_text, metadata)
                - serialized_text: The formatted output
                - metadata: Information about which components were serialized
        """
        pass

Serializer Provider

A BaseSerializerProvider abstracts the serialization strategy from the document instance, allowing flexible serializer selection:
class BaseSerializerProvider:
    def get_serializer(self, format: str) -> BaseDocSerializer:
        """Get appropriate serializer for the requested format."""
        pass

Built-in Serializers

Docling provides predefined serializers for common formats:

Markdown Serializer

Class: MarkdownDocSerializer Features:
  • Converts document structure to Markdown syntax
  • Preserves headings, lists, tables, and links
  • Handles images as references or inline data
  • Supports custom formatting options
Usage via export method:
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("document.pdf")

# Shorthand using export method
markdown = result.document.export_to_markdown()
print(markdown)
Direct usage:
from docling_core.transforms.serializer.markdown import MarkdownDocSerializer

serializer = MarkdownDocSerializer(doc=result.document)
markdown_text, metadata = serializer.serialize()

print(f"Serialized {len(metadata['elements'])} elements")
print(markdown_text)
Output example:
# Document Title

## Section 1

This is a paragraph with **bold** and *italic* text.

- List item 1
- List item 2

| Column A | Column B |
|----------|----------|
| Cell 1   | Cell 2   |

HTML Serializer

Class: HTMLDocSerializer Features:
  • Semantic HTML5 output
  • Preserves document structure with appropriate tags
  • CSS-friendly class names
  • Table structure preserved
Usage via export method:
html = result.document.export_to_html()
print(html)
Direct usage:
from docling_core.transforms.serializer.html import HTMLDocSerializer

serializer = HTMLDocSerializer(doc=result.document)
html_text, metadata = serializer.serialize()
Output example:
<article>
  <h1>Document Title</h1>
  <section>
    <h2>Section 1</h2>
    <p>This is a paragraph with <strong>bold</strong> and <em>italic</em> text.</p>
    <ul>
      <li>List item 1</li>
      <li>List item 2</li>
    </ul>
    <table>
      <tr><th>Column A</th><th>Column B</th></tr>
      <tr><td>Cell 1</td><td>Cell 2</td></tr>
    </table>
  </section>
</article>

DocTags Serializer

Class: DocTagsDocSerializer Features:
  • Structured format with bounding box information
  • Used for training vision-language models
  • Preserves spatial layout information
Usage:
doctags = result.document.export_to_doctags()
Output example:
<title bbox="[50, 100, 300, 150]">Document Title</title>
<paragraph bbox="[50, 200, 400, 250]">This is content.</paragraph>

DoclingDocument Export Methods

The DoclingDocument class provides convenient export shortcuts:

export_to_markdown()

markdown: str = doc.export_to_markdown(
    image_placeholder: str = "<image>",  # Placeholder for images
    # Additional options...
)

export_to_html()

html: str = doc.export_to_html(
    # HTML-specific options...
)

export_to_dict()

data: dict = doc.export_to_dict()
# Returns a dictionary representation of the document structure

export_to_doctags()

doctags: str = doc.export_to_doctags()

export_to_json()

import json

json_str = json.dumps(doc.export_to_dict(), indent=2)
# Serialize to JSON format
These export methods are convenience wrappers that internally instantiate and use the corresponding serializers. For advanced use cases requiring custom configuration, use serializers directly.

Custom Serializers

You can create custom serializers for specialized output formats:

Document Serializer Example

from docling_core.transforms.serializer.base import BaseDocSerializer
from docling_core.types.doc import DoclingDocument, TextItem, TableItem

class CustomDocSerializer(BaseDocSerializer):
    def __init__(self, doc: DoclingDocument):
        self.doc = doc
    
    def serialize(self, **kwargs) -> tuple[str, dict]:
        output = []
        metadata = {"elements": []}
        
        for item, level in self.doc.iterate_items():
            if isinstance(item, TextItem):
                # Custom text formatting
                indent = "  " * level
                output.append(f"{indent}{item.text}")
                metadata["elements"].append(item.label)
            
            elif isinstance(item, TableItem):
                # Custom table formatting
                output.append("[TABLE]")
                metadata["elements"].append("table")
        
        return "\n".join(output), metadata

# Usage
serializer = CustomDocSerializer(doc=result.document)
custom_output, meta = serializer.serialize()
print(custom_output)

Component Serializer Example

from docling_core.transforms.serializer.base import BaseTableSerializer
from docling_core.types.doc import TableItem

class CSVTableSerializer(BaseTableSerializer):
    def serialize_table(self, table: TableItem, **kwargs) -> str:
        rows = []
        for row in table.data.grid:
            cells = [cell.text for cell in row]
            rows.append(",".join(cells))
        return "\n".join(rows)

# Usage with custom document serializer
class CSVDocSerializer(BaseDocSerializer):
    def __init__(self, doc: DoclingDocument):
        self.doc = doc
        self.table_serializer = CSVTableSerializer()
    
    def serialize(self, **kwargs) -> tuple[str, dict]:
        output = []
        for table in self.doc.tables:
            output.append(self.table_serializer.serialize_table(table))
            output.append("")  # Blank line between tables
        
        return "\n".join(output), {"table_count": len(self.doc.tables)}

Serialization with Options

Many serializers accept configuration options:
from docling_core.transforms.serializer.markdown import MarkdownDocSerializer

serializer = MarkdownDocSerializer(
    doc=result.document,
    # Options (if supported)
)

markdown, metadata = serializer.serialize(
    image_mode="placeholder",  # How to handle images
    max_depth=5,                # Maximum heading depth
    # Additional parameters
)

Metadata from Serialization

Serializers return metadata about the serialization process:
markdown, metadata = serializer.serialize()

print(f"Elements serialized: {metadata.get('elements', [])}")
print(f"Tables included: {metadata.get('table_count', 0)}")
print(f"Images included: {metadata.get('image_count', 0)}")
This metadata is useful for:
  • Tracking which document components were included
  • Debugging serialization issues
  • Generating statistics about the output

Integration with Export Workflows

Save to File

from pathlib import Path

converter = DocumentConverter()
result = converter.convert("input.pdf")

# Export to Markdown file
markdown = result.document.export_to_markdown()
Path("output.md").write_text(markdown, encoding="utf-8")

# Export to HTML file
html = result.document.export_to_html()
Path("output.html").write_text(html, encoding="utf-8")

# Export to JSON file
import json
json_data = result.document.export_to_dict()
Path("output.json").write_text(json.dumps(json_data, indent=2), encoding="utf-8")

Batch Export

converter = DocumentConverter()

for input_file in input_files:
    result = converter.convert(input_file)
    
    # Export to multiple formats
    output_base = Path(input_file).stem
    
    Path(f"{output_base}.md").write_text(
        result.document.export_to_markdown(), encoding="utf-8"
    )
    
    Path(f"{output_base}.html").write_text(
        result.document.export_to_html(), encoding="utf-8"
    )

Advanced Serialization Techniques

Conditional Element Inclusion

class SelectiveMarkdownSerializer(BaseDocSerializer):
    def __init__(self, doc: DoclingDocument, include_tables: bool = True):
        self.doc = doc
        self.include_tables = include_tables
    
    def serialize(self, **kwargs) -> tuple[str, dict]:
        output = []
        
        for item, level in self.doc.iterate_items():
            if isinstance(item, TextItem):
                output.append(item.text)
            elif isinstance(item, TableItem) and self.include_tables:
                # Include table
                output.append(self._format_table(item))
        
        return "\n\n".join(output), {}

Hierarchical Formatting

class IndentedSerializer(BaseDocSerializer):
    def serialize(self, **kwargs) -> tuple[str, dict]:
        output = []
        
        for item, level in self.doc.iterate_items():
            if isinstance(item, TextItem):
                indent = "  " * level
                prefix = "#" * (level + 1) if item.label.is_heading() else "-"
                output.append(f"{indent}{prefix} {item.text}")
        
        return "\n".join(output), {}

Performance Considerations

Large Documents

For very large documents, consider streaming output:
class StreamingSerializer(BaseDocSerializer):
    def serialize_to_file(self, output_path: Path, **kwargs):
        with output_path.open("w", encoding="utf-8") as f:
            for item, level in self.doc.iterate_items():
                # Write incrementally instead of building full string
                f.write(self._format_item(item, level))
                f.write("\n")

Caching Serialized Output

from functools import lru_cache

class CachedSerializer(BaseDocSerializer):
    @lru_cache(maxsize=1)
    def serialize(self, **kwargs) -> tuple[str, dict]:
        # Expensive serialization cached
        return self._do_serialize(**kwargs)

Use Cases

Use Markdown or HTML serializers to generate previews for document management systems.
Export to plain text for indexing in search engines while preserving document metadata.
Serialize to Markdown for feeding into language models, optionally using chunking first.
Export to HTML for direct publishing to web platforms or CMSs.
Convert between document formats by combining Docling conversion with serialization.

Examples

For detailed examples of serialization in action, see:

DoclingDocument

Learn about the document representation being serialized

Chunking

Chunk documents before or after serialization

docling-core Serializers

Explore serializer base class definitions

Examples

See serialization in practical examples

Build docs developers (and LLMs) love