Overview
The HTML backend (HTMLDocumentBackend) parses HTML documents and web pages, converting them directly to DoclingDocument format. It preserves document structure, formatting, and handles complex HTML layouts including tables, lists, and embedded images.
Features
- Semantic structure preservation - Headings, paragraphs, lists, tables
- Rich formatting support - Bold, italic, underline, strikethrough, code
- Hyperlink preservation - Internal and external links
- Table extraction - Complex tables with merged cells and rich content
- Image handling - Embedded images with remote/local fetching
- List hierarchy - Nested lists with proper indentation
- Code blocks - Monospace code and pre-formatted text
- Furniture detection - Automatic header/footer/title handling
Usage
Basic Conversion
With Backend Options
HTMLBackendOptions
Configuration options for HTML parsing.Parameters
Backend type identifier. Always set to
"html" for HTML backends.Whether the backend should access remote or local resources to parse images in an HTML document.Enable when:
- You want to include embedded images
- Processing web pages with external images
- Images are needed for final output
The URI that originates the HTML document. If provided, the backend will use it to resolve relative paths in the HTML document.Required for:
- Resolving relative image paths
- Resolving relative hyperlinks
- Remote resource fetching
Add the HTML
<title> tag as furniture in the DoclingDocument.The title is added as furniture-layer content (metadata).Infer all the content before the first header as furniture.Automatically marks as furniture:
- Content before first
<h1>-<h6> - Headers and footers
- Navigation elements (when detected)
Enable fetching of remote resources referenced in the HTML.
Enable fetching of local resources referenced in the HTML.
Supported Elements
Headings
H1-H6 Hierarchy
H1-H6 Hierarchy
HTML headings map to document structure:Automatic hierarchy:
<h1>→ Title or top-level heading (Level 0)<h2>-<h6>→ Headings Level 1-5
- Skipped levels create invisible section groups
- Maintains proper nesting even with irregular markup
Text and Formatting
Inline Formatting
Inline Formatting
Supported HTML tags:
<b>,<strong>→ Bold<i>,<em>,<var>→ Italic<u>,<ins>→ Underline<s>,<del>→Strikethrough<sub>→ Subscript<sup>→ Superscript<code>,<kbd>,<samp>→ Code formatting
Hyperlinks
Hyperlinks
<a href="..."> tags preserve links:- Relative URL resolution (requires
source_uri) - Protocol-relative URLs (
//example.com) - Fragment identifiers (
#section)
Code Blocks
Code Blocks
<pre> and <code> elements:<pre>→ Preserved formatting and whitespace<code>→ Inline code or code blocks- Nested formatting preserved
Lists
Complete list structure preservation:- Ordered (
<ol>) and unordered (<ul>) lists - Custom start numbers (
<ol start="5">) - Nested lists with proper hierarchy
- Inline formatting in list items
Tables
Advanced table extraction with rich content:- Row and column headers (
<th>) - Merged cells (
rowspan,colspan) - Rich cell content (formatted text, images, nested elements)
- Simple cells (plain text)
- Header detection (
<thead>, first row)
Rich Table Cells
Cells with complex content becomeRichTableCell:
Images
Image handling with remote/local fetching:<img>tags<figure>with<figcaption>- Alt text as caption fallback
- Remote images (with
fetch_images=True) - Local images (with
source_uriand path resolution) - Base64 data URLs (
data:image/png;base64,...)
Content Layers
Automatic content layer detection:- HTML
<title>element - Content before first heading (if
infer_furniture=True) <footer>elements
- All content after first heading
- Explicit body content
HTML Cleanup
Automatic cleanup and normalization:Remove unwanted elements
<script>and<noscript>tags<style>tags- Hidden elements (
hiddenattribute)
Fix invalid structure
- Block elements inside
<p>tags - Nested paragraph correction
- Proper flow content handling
URL Resolution
Whensource_uri is provided:
- Relative paths:
../images/pic.png→https://example.com/images/pic.png - Absolute paths:
/static/img.png→https://example.com/static/img.png - Protocol-relative:
//cdn.example.com/img.png→https://cdn.example.com/img.png
Advanced Features
Special Blocks
Details and Summary
Details and Summary
<details> and <summary> elements:Addresses
Addresses
<address> elements converted to text items.Figures
Figures
<figure> with <figcaption> properly linked:CAPTION label item.Performance
- Speed: Fast declarative parsing
- Memory: Low to moderate (depends on HTML size)
- Remote fetching: Can slow down if many images
Limitations
Troubleshooting
Missing images
Missing images
Solution: Enable image fetching and set source URI
Broken relative links
Broken relative links
Cause: Missing
source_uriSolution: Provide source URI for resolutionIncorrect structure
Incorrect structure
Possible causes:
- Invalid HTML markup
- Missing closing tags
- Block elements in inline context
Too much furniture content
Too much furniture content
Solution: Disable furniture inference
Use Cases
Documentation Sites
Convert HTML documentation to structured format
Web Archival
Preserve web page content in structured format
Content Migration
Extract content from HTML for migration to other systems
Web Scraping
Structure web content for analysis or indexing
See Also
- Backends Overview - Backend architecture
- Export Formats - Export to Markdown and other formats
- DocumentConverter - Main conversion API