HTML Backend

Overview

The HTML backend (HTMLDocumentBackend) parses HTML documents and web pages, converting them directly to DoclingDocument format. It preserves document structure, formatting, and handles complex HTML layouts including tables, lists, and embedded images.

Features

Semantic structure preservation - Headings, paragraphs, lists, tables
Rich formatting support - Bold, italic, underline, strikethrough, code
Hyperlink preservation - Internal and external links
Table extraction - Complex tables with merged cells and rich content
Image handling - Embedded images with remote/local fetching
List hierarchy - Nested lists with proper indentation
Code blocks - Monospace code and pre-formatted text
Furniture detection - Automatic header/footer/title handling

Usage

Basic Conversion

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("page.html")

doc = result.document
print(doc.export_to_markdown())

With Backend Options

from docling.document_converter import DocumentConverter, HtmlFormatOption
from docling.datamodel.backend_options import HTMLBackendOptions
from pathlib import Path

backend_options = HTMLBackendOptions(
    fetch_images=True,
    source_uri=Path("index.html"),
    add_title=True,
    infer_furniture=True
)

converter = DocumentConverter(
    format_options={
        HtmlFormatOption: HtmlFormatOption(
            backend_options=backend_options
        )
    }
)

result = converter.convert("page.html")

HTMLBackendOptions

Configuration options for HTML parsing.

Parameters

kind

Literal['html']

default:"'html'"

Backend type identifier. Always set to "html" for HTML backends.

fetch_images

bool

default:"False"

Whether the backend should access remote or local resources to parse images in an HTML document.Enable when:

You want to include embedded images
Processing web pages with external images
Images are needed for final output

options = HTMLBackendOptions(fetch_images=True)

source_uri

AnyUrl | PurePath | None

default:"None"

The URI that originates the HTML document. If provided, the backend will use it to resolve relative paths in the HTML document.Required for:

Resolving relative image paths
Resolving relative hyperlinks
Remote resource fetching

# For local files
options = HTMLBackendOptions(
    source_uri=Path("/path/to/index.html")
)

# For web pages
from pydantic import AnyUrl
options = HTMLBackendOptions(
    source_uri=AnyUrl("https://example.com/page.html")
)

add_title

bool

default:"True"

Add the HTML <title> tag as furniture in the DoclingDocument.The title is added as furniture-layer content (metadata).

infer_furniture

bool

default:"True"

Infer all the content before the first header as furniture.Automatically marks as furniture:

Content before first <h1>-<h6>
Headers and footers
Navigation elements (when detected)

enable_remote_fetch

bool

default:"False"

Enable fetching of remote resources referenced in the HTML.

enable_local_fetch

bool

default:"False"

Enable fetching of local resources referenced in the HTML.

Supported Elements

Headings

H1-H6 Hierarchy

HTML headings map to document structure:

<h1> → Title or top-level heading (Level 0)
<h2>-<h6> → Headings Level 1-5

for item, _ in doc.iterate_items():
    if isinstance(item, HeadingItem):
        print(f"H{item.level + 1}: {item.text}")

Automatic hierarchy:

Skipped levels create invisible section groups
Maintains proper nesting even with irregular markup

Text and Formatting

Inline Formatting

Supported HTML tags:

<b>, <strong> → Bold
<i>, <em>, <var> → Italic
<u>, <ins> → Underline
<s>, <del> → ~~Strikethrough~~
<sub> → Subscript
<sup> → Superscript
<code>, <kbd>, <samp> → Code formatting

for item, _ in doc.iterate_items():
    if isinstance(item, TextItem) and item.formatting:
        print(f"{item.text}:")
        print(f"  Bold: {item.formatting.bold}")
        print(f"  Italic: {item.formatting.italic}")

Hyperlinks

<a href="..."> tags preserve links:

for item, _ in doc.iterate_items():
    if isinstance(item, TextItem) and item.hyperlink:
        print(f"Link: {item.text} → {item.hyperlink}")

Features:

Relative URL resolution (requires source_uri)
Protocol-relative URLs (//example.com)
Fragment identifiers (#section)

Code Blocks

<pre> and <code> elements:

for item, _ in doc.iterate_items():
    if isinstance(item, CodeItem):
        print(f"Code: {item.text}")

Handling:

<pre> → Preserved formatting and whitespace
<code> → Inline code or code blocks
Nested formatting preserved

Lists

Complete list structure preservation:

<ul>
  <li>Item 1
    <ul>
      <li>Nested item</li>
    </ul>
  </li>
  <li>Item 2</li>
</ul>

<ol start="5">
  <li>Numbered item 5</li>
  <li>Numbered item 6</li>
</ol>

for item, _ in doc.iterate_items():
    if isinstance(item, ListItem):
        indent = "  " * item.level
        marker = item.marker if item.enumerated else "•"
        print(f"{indent}{marker} {item.text}")

Features:

Ordered (<ol>) and unordered (<ul>) lists
Custom start numbers (<ol start="5">)
Nested lists with proper hierarchy
Inline formatting in list items

Tables

Advanced table extraction with rich content:

for table, _ in doc.iterate_items():
    if isinstance(table, TableItem):
        print(f"Table: {table.data.num_rows} x {table.data.num_cols}")
        
        for cell in table.data.table_cells:
            if isinstance(cell, RichTableCell):
                print(f"Rich cell: {cell.text}")
                # Cell contains nested items (text, images, etc.)
            else:
                print(f"Simple cell: {cell.text}")

Supported:

Row and column headers (<th>)
Merged cells (rowspan, colspan)
Rich cell content (formatted text, images, nested elements)
Simple cells (plain text)
Header detection (<thead>, first row)

Rich Table Cells

Cells with complex content become RichTableCell:

<table>
  <tr>
    <td>
      <strong>Bold text</strong>
      <img src="icon.png" />
      <a href="link.html">Link</a>
    </td>
  </tr>
</table>

The cell content is parsed recursively and grouped:

for table, _ in doc.iterate_items():
    if isinstance(table, TableItem):
        for cell in table.data.table_cells:
            if isinstance(cell, RichTableCell) and cell.ref:
                # Access cell's nested items
                group = cell.ref.resolve(doc)
                for child_ref in group.children:
                    item = child_ref.resolve(doc)
                    print(f"  Cell contains: {item.label}")

Images

Image handling with remote/local fetching:

for item, _ in doc.iterate_items():
    if isinstance(item, PictureItem):
        print(f"Image: {item.caption}")
        
        if item.image:
            img = item.image.pil_image
            img.save(f"extracted_{item.self_ref}.png")

Supported:

<img> tags
<figure> with <figcaption>
Alt text as caption fallback
Remote images (with fetch_images=True)
Local images (with source_uri and path resolution)
Base64 data URLs (data:image/png;base64,...)

Content Layers

Automatic content layer detection:

from docling_core.types.doc.document import ContentLayer

for item, _ in doc.iterate_items():
    if item.content_layer == ContentLayer.FURNITURE:
        print(f"Furniture: {item.text}")
    elif item.content_layer == ContentLayer.BODY:
        print(f"Body: {item.text}")

Furniture (metadata):

HTML <title> element
Content before first heading (if infer_furniture=True)
<footer> elements

Body (main content):

All content after first heading
Explicit body content

HTML Cleanup

Automatic cleanup and normalization:

Remove unwanted elements

<script> and <noscript> tags
<style> tags
Hidden elements (hidden attribute)

Fix invalid structure

Block elements inside <p> tags
Nested paragraph correction
Proper flow content handling

Normalize whitespace

<br> tags → newlines
Multiple spaces collapsed
Line breaks preserved in <pre>

URL Resolution

When source_uri is provided:

options = HTMLBackendOptions(
    source_uri=AnyUrl("https://example.com/docs/page.html"),
    fetch_images=True
)

Resolves:

Relative paths: ../images/pic.png → https://example.com/images/pic.png
Absolute paths: /static/img.png → https://example.com/static/img.png
Protocol-relative: //cdn.example.com/img.png → https://cdn.example.com/img.png

Advanced Features

Special Blocks

Details and Summary

<details> and <summary> elements:

<details>
  <summary>Click to expand</summary>
  Hidden content here
</details>

Creates section groups with summary as heading.

Addresses

<address> elements converted to text items.

Figures

<figure> with <figcaption> properly linked:

<figure>
  <img src="chart.png" alt="Chart" />
  <figcaption>Figure 1: Sales Data</figcaption>
</figure>

Caption becomes CAPTION label item.

Performance

Speed: Fast declarative parsing
Memory: Low to moderate (depends on HTML size)
Remote fetching: Can slow down if many images

# Process multiple HTML files
import concurrent.futures

def convert_html(file_path):
    converter = DocumentConverter()
    return converter.convert(file_path)

files = ["page1.html", "page2.html", "page3.html"]
with concurrent.futures.ThreadPoolExecutor() as executor:
    results = list(executor.map(convert_html, files))

Limitations

Known Limitations:

JavaScript: Dynamic content not executed
CSS: Styling not parsed (except inline semantic tags)
SVG: SVG graphics may not render
Forms: Form elements structure not preserved
iframes: Embedded frames not processed
Canvas: Canvas content not extracted

Troubleshooting

Missing images

Solution: Enable image fetching and set source URI

options = HTMLBackendOptions(
    fetch_images=True,
    source_uri=Path("index.html")
)

Broken relative links

Cause: Missing source_uriSolution: Provide source URI for resolution

Incorrect structure

Possible causes:

Invalid HTML markup
Missing closing tags
Block elements in inline context

Note: Backend attempts to fix common issues

Too much furniture content

Solution: Disable furniture inference

options = HTMLBackendOptions(infer_furniture=False)

Use Cases

Documentation Sites

Convert HTML documentation to structured format

Web Archival

Preserve web page content in structured format

Content Migration

Extract content from HTML for migration to other systems

Web Scraping

Structure web content for analysis or indexing

Core API

Pipelines

Options & Configuration

Backends

CLI

Overview

Features

Usage

Basic Conversion

With Backend Options

HTMLBackendOptions

Parameters

Supported Elements

Headings

Text and Formatting

Lists

Tables

Rich Table Cells

Images

Content Layers

HTML Cleanup

URL Resolution

Advanced Features

Special Blocks

Performance

Limitations

Troubleshooting

Use Cases

Documentation Sites

Web Archival

Content Migration

Web Scraping

See Also

Build docs developers (and LLMs) love

Core API

Pipelines

Options & Configuration

Backends

CLI

​Overview

​Features

​Usage

​Basic Conversion

​With Backend Options

​HTMLBackendOptions

​Parameters

​Supported Elements

​Headings

​Text and Formatting

​Lists

​Tables

​Rich Table Cells

​Images

​Content Layers

​HTML Cleanup

​URL Resolution

​Advanced Features

​Special Blocks

​Performance

​Limitations

​Troubleshooting

​Use Cases

Documentation Sites

Web Archival

Content Migration

Web Scraping

​See Also

Build docs developers (and LLMs) love

Overview

Features

Usage

Basic Conversion

With Backend Options

HTMLBackendOptions

Parameters

Supported Elements

Headings

Text and Formatting

Lists

Tables

Rich Table Cells

Images

Content Layers

HTML Cleanup

URL Resolution

Advanced Features

Special Blocks

Performance

Limitations

Troubleshooting

Use Cases

See Also