Skip to main content

Overview

Once you’ve converted a document to DoclingDocument, Docling offers multiple export formats:
  • Markdown: Human-readable text with formatting
  • HTML: Rich HTML with embedded or linked images
  • JSON: Structured data for programmatic access
  • DocTags: Structured text format for downstream NLP
  • Plain Text: Unformatted text content
  • YAML: Human-readable structured data
All exports preserve document structure, metadata, and content from the conversion process.

Quick Export

Basic export to different formats:
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("document.pdf")

# Export to Markdown string
markdown = result.document.export_to_markdown()
print(markdown)

# Save to file
result.document.save_as_markdown("output.md")

Markdown Export

Markdown is the most common export format, ideal for RAG, documentation, and human reading.

Basic Markdown

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("document.pdf")

# Standard Markdown with structure
markdown = result.document.export_to_markdown()
print(markdown)
Example output:
# Document Title

## Section 1

This is a paragraph with **bold** and *italic* text.

### Subsection 1.1

- Bullet point 1
- Bullet point 2

| Header 1 | Header 2 |
|----------|----------|
| Cell 1   | Cell 2   |

Plain Text (No Formatting)

# Remove all formatting for pure text
text = result.document.export_to_markdown(strict_text=True)
print(text)
Example output:
Document Title

Section 1

This is a paragraph with bold and italic text.

Subsection 1.1

Bullet point 1
Bullet point 2

Header 1 Header 2
Cell 1 Cell 2

Image Handling

Control how images are included:
from docling_core.types.doc import ImageRefMode

markdown = result.document.export_to_markdown(
    image_mode=ImageRefMode.PLACEHOLDER
)
Output:
![](picture-1)
Images referenced by ID, actual image data not included.

Save with Options

from docling_core.types.doc import ImageRefMode

result.document.save_as_markdown(
    "output.md",
    image_mode=ImageRefMode.EMBEDDED,
    strict_text=False,
)

HTML Export

HTML export creates rich, formatted output with embedded or linked images.

Basic HTML

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("document.pdf")

# Export to HTML
html = result.document.export_to_html()
print(html)
Example output:
<!DOCTYPE html>
<html>
<head>
    <meta charset="UTF-8">
    <title>Document Title</title>
</head>
<body>
    <h1>Document Title</h1>
    <h2>Section 1</h2>
    <p>This is a paragraph with <strong>bold</strong> and <em>italic</em> text.</p>
    <table>
        <tr><th>Header 1</th><th>Header 2</th></tr>
        <tr><td>Cell 1</td><td>Cell 2</td></tr>
    </table>
</body>
</html>

HTML with Embedded Images

from docling_core.types.doc import ImageRefMode

# Embed images as base64 data URLs
html = result.document.export_to_html(
    image_mode=ImageRefMode.EMBEDDED
)

# Save to file
result.document.save_as_html(
    "output.html",
    image_mode=ImageRefMode.EMBEDDED
)
Images are embedded directly in HTML, creating a standalone file.

HTML with Page Images

from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling_core.types.doc import ImageRefMode

# Generate page images during conversion
pipeline_options = PdfPipelineOptions(
    generate_page_images=True,
    images_scale=2.0,
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

result = converter.convert("document.pdf")

# Export HTML with embedded page images
result.document.save_as_html(
    "output.html",
    image_mode=ImageRefMode.EMBEDDED
)

JSON Export

JSON export provides structured, machine-readable document data.

Basic JSON

import json
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("document.pdf")

# Export to dict
data = result.document.export_to_dict()

# Pretty-print JSON
print(json.dumps(data, indent=2))

# Save to file
with open("output.json", "w") as f:
    json.dump(data, f, indent=2)

# Or use helper
result.document.save_as_json("output.json")

JSON Structure

{
  "schema_name": "DoclingDocument",
  "version": "1.0.0",
  "name": "document.pdf",
  "metadata": {
    "pages": 10,
    "format": "PDF"
  },
  "pages": [
    {
      "page_no": 1,
      "size": {"width": 612.0, "height": 792.0}
    }
  ],
  "furniture": [
    {
      "self_ref": "#/texts/0",
      "type": "subtitle-level-1",
      "text": "Document Title"
    }
  ],
  "body": [
    {
      "self_ref": "#/texts/1",
      "type": "paragraph",
      "text": "This is a paragraph."
    }
  ]
}

JSON with Images

from docling_core.types.doc import ImageRefMode

# Embed images as base64 in JSON
data = result.document.export_to_dict(
    image_mode=ImageRefMode.EMBEDDED
)

# Or use placeholders
data = result.document.export_to_dict(
    image_mode=ImageRefMode.PLACEHOLDER
)

result.document.save_as_json(
    "output.json",
    image_mode=ImageRefMode.EMBEDDED
)

DocTags Export

DocTags is a structured text format designed for NLP pipelines:
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("document.pdf")

# Export to DocTags
doctags = result.document.export_to_doctags()
print(doctags)

# Save to file
result.document.save_as_doctags("output.doctags.txt")
Example output:
<title>Document Title</title>
<section-header>Section 1</section-header>
<paragraph>This is a paragraph with bold and italic text.</paragraph>
<subsection-header>Subsection 1.1</subsection-header>
<list-item>Bullet point 1</list-item>
<list-item>Bullet point 2</list-item>
<table>
  <row>
    <cell>Header 1</cell>
    <cell>Header 2</cell>
  </row>
  <row>
    <cell>Cell 1</cell>
    <cell>Cell 2</cell>
  </row>
</table>
DocTags format is ideal for:
  • Named entity recognition (NER)
  • Document classification
  • Information extraction
  • Custom NLP pipelines

YAML Export

Human-readable structured data format:
import yaml
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("document.pdf")

# Export to dict, then to YAML
data = result.document.export_to_dict()
yaml_str = yaml.safe_dump(data, default_flow_style=False)

print(yaml_str)

# Save to file
with open("output.yaml", "w") as f:
    yaml.safe_dump(data, f, default_flow_style=False)
Example output:
schema_name: DoclingDocument
version: 1.0.0
name: document.pdf
metadata:
  pages: 10
  format: PDF
pages:
  - page_no: 1
    size:
      width: 612.0
      height: 792.0
body:
  - self_ref: '#/texts/1'
    type: paragraph
    text: This is a paragraph.

Batch Export

Export multiple documents to various formats:
import json
from pathlib import Path
from docling_core.types.doc import ImageRefMode
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import ConversionStatus

input_files = list(Path("documents/").glob("*.pdf"))
output_dir = Path("output/")
output_dir.mkdir(parents=True, exist_ok=True)

converter = DocumentConverter()

for result in converter.convert_all(input_files, raises_on_error=False):
    if result.status == ConversionStatus.SUCCESS:
        doc_filename = result.input.file.stem
        
        # Export to multiple formats
        result.document.save_as_markdown(
            output_dir / f"{doc_filename}.md",
            image_mode=ImageRefMode.PLACEHOLDER,
        )
        result.document.save_as_html(
            output_dir / f"{doc_filename}.html",
            image_mode=ImageRefMode.EMBEDDED,
        )
        result.document.save_as_json(
            output_dir / f"{doc_filename}.json",
            image_mode=ImageRefMode.PLACEHOLDER,
        )
        result.document.save_as_doctags(
            output_dir / f"{doc_filename}.doctags.txt"
        )
        
        print(f"Exported: {doc_filename}")

Multimodal Export (Parquet)

Export page images, text, and metadata to Parquet for machine learning:
import pandas as pd
from pathlib import Path
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.utils.export import generate_multimodal_pages

# Generate page images during conversion
pipeline_options = PdfPipelineOptions(
    generate_page_images=True,
    images_scale=2.0,
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

result = converter.convert("document.pdf")

rows = []
for (
    content_text,
    content_md,
    content_dt,
    page_cells,
    page_segments,
    page,
) in generate_multimodal_pages(result):
    rows.append(
        {
            "document": result.input.file.name,
            "page_num": page.page_no,
            "image": {
                "width": page.image.width,
                "height": page.image.height,
                "bytes": page.image.tobytes(),
            },
            "text": content_text,
            "markdown": content_md,
            "doctags": content_dt,
            "cells": page_cells,
            "segments": page_segments,
        }
    )

# Export to Parquet
df = pd.json_normalize(rows)
df.to_parquet("output.parquet")

print(f"Exported {len(rows)} pages to output.parquet")
Parquet export is useful for:
  • Training multimodal ML models
  • Building document datasets
  • Efficient storage of page images + text
  • Integration with data science workflows

Custom Export Pipeline

Access document structure programmatically:
from docling.document_converter import DocumentConverter
from docling_core.types.doc import (
    TextItem,
    TableItem,
    PictureItem,
    SectionHeaderItem,
)

converter = DocumentConverter()
result = converter.convert("document.pdf")
doc = result.document

# Iterate through all items
for item, level in doc.iterate_items():
    if isinstance(item, SectionHeaderItem):
        print(f"Header (level {level}): {item.text}")
    elif isinstance(item, TextItem):
        print(f"Text: {item.text[:50]}...")
    elif isinstance(item, TableItem):
        print(f"Table: {item.num_rows} rows, {item.num_cols} cols")
        # Export table to CSV, pandas, etc.
        df = item.export_to_dataframe()
        df.to_csv(f"table_{item.self_ref}.csv")
    elif isinstance(item, PictureItem):
        print(f"Picture: {item.self_ref}")
        if item.image:
            item.image.save(f"picture_{item.self_ref}.png")

Export Comparison

FormatUse CaseImagesStructureSize
MarkdownRAG, documentation, human readingEmbedded/LinkedBasicSmall
HTMLWeb display, rich previewsEmbedded/LinkedRichMedium
JSONAPI integration, programmatic accessEmbedded/LinkedFullMedium
DocTagsNLP pipelines, text analysisNoSemanticSmall
Plain TextSearch indexing, simple RAGNoNoneSmallest
YAMLConfiguration, human editingEmbedded/LinkedFullMedium
ParquetML datasets, analyticsRaw bytesFull + metadataLarge

Best Practices

1

Choose the right format for your use case

  • RAG/Search: Markdown or Plain Text
  • Web display: HTML with embedded images
  • API integration: JSON
  • NLP pipelines: DocTags
  • ML training: Parquet
2

Consider image handling

  • Standalone files: Use ImageRefMode.EMBEDDED
  • Separate image files: Use ImageRefMode.REFERENCED and save images separately
  • Text-only: Use ImageRefMode.PLACEHOLDER or strict_text=True
3

Optimize for file size

  • Use strict_text=True for smallest Markdown
  • Use ImageRefMode.PLACEHOLDER to exclude image data
  • Use JSON over YAML for large datasets (more compact)
4

Preserve structure

  • Use JSON or YAML for full document structure
  • Use DocTags for semantic structure only
  • Use Markdown for human-readable structure

Next Steps

Basic Conversion

Learn the fundamentals of document conversion

Batch Processing

Export large document collections efficiently

LangChain Integration

Use exports in RAG pipelines with LangChain

LlamaIndex Integration

Build search indexes with LlamaIndex

Build docs developers (and LLMs) love