DoclingDocument

Overview

DoclingDocument is the core data structure in Docling that represents a converted document. It provides a rich, structured representation of the document content with support for various document elements (text, tables, figures, etc.) and multiple export formats. This class is part of the docling-core library and is exposed through Docling’s API.

Class Definition

from docling.datamodel.document import DoclingDocument

Key Concepts

Document Items

A DoclingDocument is composed of items, where each item represents a structural element of the document:

TextItem - Paragraphs, headings, captions
TableItem - Tables with structure and content
PictureItem - Images and figures
SectionHeaderItem - Section headings with hierarchy
ListItem - List items (ordered/unordered)
DocItem - Generic document items

Item Labels

Each item has a label from DocItemLabel enum:

TITLE - Document title
SECTION_HEADER - Section heading
PARAGRAPH - Regular paragraph
TEXT - Plain text
LIST_ITEM - List item
TABLE - Table
PICTURE - Image or figure
CAPTION - Caption for figures/tables
FORMULA - Mathematical formula
CODE - Code block
PAGE_HEADER / PAGE_FOOTER - Headers and footers
FOOTNOTE - Footnote
And more…

Basic Attributes

name

str

The document name or identifier.

texts

list

List of text items in the document.

tables

list

List of table items in the document.

pictures

list

List of picture items in the document.

Core Methods

`iterate_items()`

Iterate over all items in the document in order.

for item in doc.iterate_items():
    print(f"{item.label}: {item.get_text()}")

yields

DocItem

Document items in document order.

`export_to_markdown()`

Export the document to Markdown format.

markdown_content = doc.export_to_markdown()
with open("output.md", "w") as f:
    f.write(markdown_content)

return

str

The document content in Markdown format.

`export_to_html()`

Export the document to HTML format.

html_content = doc.export_to_html()
with open("output.html", "w") as f:
    f.write(html_content)

return

str

The document content in HTML format.

`export_to_dict()`

Export the document to a dictionary representation.

json_dict = doc.export_to_dict()
import json
with open("output.json", "w") as f:
    json.dump(json_dict, f, indent=2)

return

dict

The document structure as a dictionary, suitable for JSON serialization.

`save_as_markdown()`

Save the document directly to a Markdown file.

doc.save_as_markdown("output.md")

filename

Union[str, Path]

required

Path where the Markdown file will be saved.

`save_as_html()`

Save the document directly to an HTML file.

doc.save_as_html("output.html")

filename

Union[str, Path]

required

Path where the HTML file will be saved.

Usage Examples

Basic Document Access

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("document.pdf")
doc = result.document

print(f"Document name: {doc.name}")
print(f"Number of texts: {len(doc.texts)}")
print(f"Number of tables: {len(doc.tables)}")
print(f"Number of pictures: {len(doc.pictures)}")

Iterating Through Items

from docling.datamodel.document import DocItemLabel

doc = result.document

# Iterate all items
for item in doc.iterate_items():
    if item.label == DocItemLabel.SECTION_HEADER:
        print(f"\n## {item.get_text()}")
    elif item.label == DocItemLabel.PARAGRAPH:
        print(item.get_text())
    elif item.label == DocItemLabel.TABLE:
        print("[Table found]")

Filtering by Item Type

# Get all section headers
headers = [
    item for item in doc.iterate_items()
    if item.label == DocItemLabel.SECTION_HEADER
]

for header in headers:
    print(f"Section: {header.get_text()}")

# Get all tables
tables = [
    item for item in doc.iterate_items()
    if item.label == DocItemLabel.TABLE
]

print(f"Found {len(tables)} tables")

Exporting to Different Formats

from pathlib import Path

doc = result.document
output_dir = Path("output")
output_dir.mkdir(exist_ok=True)

# Export to Markdown
markdown = doc.export_to_markdown()
(output_dir / "document.md").write_text(markdown)

# Export to HTML
html = doc.export_to_html()
(output_dir / "document.html").write_text(html)

# Export to JSON
import json
json_dict = doc.export_to_dict()
(output_dir / "document.json").write_text(
    json.dumps(json_dict, indent=2)
)

Working with Tables

from docling.datamodel.document import TableItem

doc = result.document

# Access tables directly
for i, table in enumerate(doc.tables, 1):
    print(f"\nTable {i}:")
    # Tables have structured data
    if hasattr(table, 'data'):
        print(f"Rows: {len(table.data)}")
        
# Or iterate to find tables
for item in doc.iterate_items():
    if isinstance(item, TableItem):
        print(f"Found table: {item.caption or 'No caption'}")

Working with Text Content

from docling.datamodel.document import TextItem, SectionHeaderItem

doc = result.document

# Build a table of contents
toc = []
for item in doc.iterate_items():
    if isinstance(item, SectionHeaderItem):
        toc.append(item.get_text())

print("Table of Contents:")
for i, heading in enumerate(toc, 1):
    print(f"{i}. {heading}")

Extracting Plain Text

# Get all text content
all_text = []
for item in doc.iterate_items():
    text = item.get_text()
    if text:
        all_text.append(text)

full_text = "\n\n".join(all_text)
print(full_text)

Searching Document Content

import re

def search_document(doc, pattern):
    """Search for a pattern in the document."""
    results = []
    for item in doc.iterate_items():
        text = item.get_text()
        if text and re.search(pattern, text, re.IGNORECASE):
            results.append({
                'label': item.label,
                'text': text,
                'item': item
            })
    return results

# Search for a term
results = search_document(doc, r'artificial intelligence')
for result in results:
    print(f"{result['label']}: {result['text'][:100]}...")

Document Statistics

from collections import Counter

def get_document_stats(doc):
    """Get statistics about document structure."""
    label_counts = Counter()
    total_text_length = 0
    
    for item in doc.iterate_items():
        label_counts[item.label] += 1
        text = item.get_text()
        if text:
            total_text_length += len(text)
    
    return {
        'label_counts': dict(label_counts),
        'total_items': sum(label_counts.values()),
        'total_text_length': total_text_length
    }

stats = get_document_stats(doc)
print(f"Total items: {stats['total_items']}")
print(f"Total text length: {stats['total_text_length']} characters")
print("\nItem type distribution:")
for label, count in stats['label_counts'].items():
    print(f"  {label}: {count}")

Core API

Pipelines

Options & Configuration

Backends

CLI

DoclingDocument

Overview

Class Definition

Key Concepts

Document Items

Item Labels

Basic Attributes

Core Methods

`iterate_items()`

`export_to_markdown()`

`export_to_html()`

`export_to_dict()`

`save_as_markdown()`

`save_as_html()`

Usage Examples

Basic Document Access

Iterating Through Items

Filtering by Item Type

Exporting to Different Formats

Working with Tables

Working with Text Content

Extracting Plain Text

Searching Document Content

Document Statistics

See Also

Build docs developers (and LLMs) love

Core API

Pipelines

Options & Configuration

Backends

CLI

​Overview

​Class Definition

​Key Concepts

​Document Items

​Item Labels

​Basic Attributes

​Core Methods

​iterate_items()

​export_to_markdown()

​export_to_html()

​export_to_dict()

​save_as_markdown()

​save_as_html()

​Usage Examples

​Basic Document Access

​Iterating Through Items

​Filtering by Item Type

​Exporting to Different Formats

​Working with Tables

​Working with Text Content

​Extracting Plain Text

​Searching Document Content

​Document Statistics

​See Also

Build docs developers (and LLMs) love

Overview

Class Definition

Key Concepts

Document Items

Item Labels

Basic Attributes

Core Methods

`iterate_items()`

`export_to_markdown()`

`export_to_html()`

`export_to_dict()`

`save_as_markdown()`

`save_as_html()`

Usage Examples

Basic Document Access

Iterating Through Items

Filtering by Item Type

Exporting to Different Formats

Working with Tables

Working with Text Content

Extracting Plain Text

Searching Document Content

Document Statistics

See Also