DoclingDocument

With Docling v2, we introduced a unified document representation format called DoclingDocument. It is defined as a Pydantic datatype that can express several features common to documents.

Key Features

The DoclingDocument format supports:

Rich content types: Text, tables, pictures, key-value pairs, and more
Document hierarchy: Sections, groups, and nested structures
Content classification: Main body vs. furniture (headers, footers)
Layout information: Bounding boxes for all items when available
Provenance tracking: Source page and position for each element

The definition of the Pydantic types is implemented in the module docling_core.types.doc. See the source code definitions for complete details.

Document Structure

A DoclingDocument exposes top-level fields organized into two categories:

Content Items

These fields store the actual content extracted from the document:

texts: All items with text representation (paragraphs, headings, equations, etc.). Base class: TextItem
tables: All tables, type TableItem. Can carry structure annotations
pictures: All pictures, type PictureItem. Can carry structure annotations
key_value_items: All key-value pairs extracted from forms or structured data

All content items inherit from DocItem and can reference parents and children through JSON pointers.

Content Structure

These fields define the document’s hierarchical organization:

body: Root node of the tree structure for the main document body
furniture: Root node for items that don’t belong in the body (headers, footers, page numbers)
groups: Container items that don’t represent content themselves but organize other items (lists, chapters)

Structure fields store NodeItem instances that reference children and parents through JSON pointers.

The reading order of the document is encapsulated through the body tree and the order of children in each node. This preserves the semantic flow of the original document.

Construction APIs

Docling provides several ways to create DoclingDocument instances:

From Conversion

The most common method is through DocumentConverter:

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("document.pdf")
doc = result.document  # DoclingDocument instance

From Scratch

You can build documents programmatically using the construction APIs:

from docling_core.types.doc import DoclingDocument, TextItem
from docling_core.types.doc.labels import DocItemLabel

doc = DoclingDocument(name="my-document")

# Add a title
title = TextItem(label=DocItemLabel.TITLE, text="Document Title")
doc.add_text(parent=doc.body, text=title)

# Add a paragraph
para = TextItem(label=DocItemLabel.PARAGRAPH, text="This is content.")
doc.add_text(parent=doc.body, text=para)

From DocTags

DoclingDocument can be loaded from DocTags format (used by VLM pipelines):

from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.document import DocTagsDocument

doctags_doc = DocTagsDocument.from_doctags_and_image_pairs(
    doctags_list, image_list
)
doc = DoclingDocument.load_from_doctags(doctag_document=doctags_doc)

Document Hierarchy

Basic Nesting

Items can be nested to represent document structure. For example, all content in the first page might be nested under a title item: doc_hierarchy_1

In this example, all items on the first page are children of the title item at #/texts/1.

Grouping

Complex structures like lists use group items to organize content: doc_hierarchy_2

Here, items under the heading “Let’s swim” (#/texts/5) include both text items and groups containing list elements. Group items are stored in the top-level groups field.

Accessing Content

Iterating Items

The iterate_items() method traverses the document structure:

for item, level in doc.iterate_items():
    if isinstance(item, TextItem):
        print(f"{'  ' * level}{item.label}: {item.text}")
    elif isinstance(item, TableItem):
        print(f"{'  ' * level}Table with {len(item.data.grid)} cells")

Filtering by Type

Access specific content types directly:

# All text items
for text_item in doc.texts:
    print(text_item.text)

# All tables
for table in doc.tables:
    print(f"Table: {table.data.num_rows}x{table.data.num_cols}")

# All pictures
for picture in doc.pictures:
    print(f"Picture at {picture.prov[0].bbox}")

Provenance Information

Each DocItem can include provenance data showing its source location:

for item in doc.texts:
    for prov in item.prov:
        print(f"Page {prov.page_no}: {prov.bbox}")
        print(f"Character span: {prov.charspan}")

Provenance information includes:

page_no: Source page number
bbox: Bounding box coordinates on the page
charspan: Character offset range in the original text

Layout Information

When available, layout details are preserved:

for item in doc.texts:
    if item.prov:
        bbox = item.prov[0].bbox
        print(f"Position: ({bbox.l}, {bbox.t}, {bbox.r}, {bbox.b})")

Page Metadata

Page-level information is stored in the pages dictionary:

for page_no, page_item in doc.pages.items():
    print(f"Page {page_no}: {page_item.size.width}x{page_item.size.height}")
    if page_item.image:
        # Page image is available
        pass

Export Methods

DoclingDocument provides built-in export to common formats:

Markdown Export

markdown = doc.export_to_markdown()
print(markdown)

HTML Export

html = doc.export_to_html()

Dictionary Export

data = doc.export_to_dict()
# Returns a dictionary representation of the document

DocTags Export

doctags = doc.export_to_doctags()

These export methods are convenience wrappers that internally use serializers. For more control over serialization, see Serialization Concepts.

Working with Tables

TableItem instances include structured data:

for table in doc.tables:
    # Access table grid
    for row in table.data.grid:
        for cell in row:
            print(cell.text, end="\t")
        print()  # New row
    
    # Table dimensions
    print(f"{table.data.num_rows} rows x {table.data.num_cols} columns")

Concatenating Documents

Multiple DoclingDocument instances can be merged:

from docling_core.types.doc import DoclingDocument

combined = DoclingDocument.concatenate(docs=[doc1, doc2, doc3])

This preserves hierarchy and updates page numbers accordingly.

Advanced Usage

Content Layers

Items can be organized into different content layers:

from docling_core.types.doc import ContentLayer

for item, level in doc.iterate_items(
    included_content_layers={ContentLayer.BODY, ContentLayer.TABLES}
):
    # Process only body and table items
    pass

Traversal Options

for item, level in doc.iterate_items(
    with_groups=True,              # Include group items
    traverse_pictures=True,         # Descend into pictures
    max_depth=3                     # Limit traversal depth
):
    process(item)

JSON Pointers

Docling uses JSON pointers for internal references:

# Example: "#/texts/5" refers to the 6th text item (0-indexed)
# Example: "#/groups/2" refers to the 3rd group item

These pointers enable efficient parent-child relationships without circular references.

Best Practices

Preserve provenance when modifying documents

When creating or modifying items, maintain provenance information for traceability back to source pages.

Use iterate_items() for traversal

The iterate_items() method respects document hierarchy and reading order. Avoid directly iterating doc.texts or other lists when structure matters.

Check item types before accessing attributes

Use isinstance() checks when processing mixed content types to safely access type-specific attributes.

Leverage export methods for common formats

Use built-in export methods (export_to_markdown(), etc.) for standard use cases before implementing custom serialization.

Serialization

Learn how to serialize DoclingDocument to various formats

Chunking

Chunk documents for RAG and LLM applications

Architecture

Understand how DoclingDocument fits in Docling’s architecture

docling-core Source

Explore the complete type definitions

Get Started

Core Concepts

Usage Guides

Advanced Features

Integrations

DoclingDocument

Key Features

Document Structure

Content Items

Content Structure

Construction APIs

From Conversion

From Scratch

From DocTags

Document Hierarchy

Basic Nesting

Grouping

Accessing Content

Iterating Items

Filtering by Type

Provenance Information

Layout Information

Page Metadata

Export Methods

Markdown Export

HTML Export

Dictionary Export

DocTags Export

Working with Tables

Concatenating Documents

Advanced Usage

Content Layers

Traversal Options

JSON Pointers

Best Practices

Serialization

Chunking

Architecture

docling-core Source

Build docs developers (and LLMs) love

Get Started

Core Concepts

Usage Guides

Advanced Features

Integrations

​Key Features

​Document Structure

​Content Items

​Content Structure

​Construction APIs

​From Conversion

​From Scratch

​From DocTags

​Document Hierarchy

​Basic Nesting

​Grouping

​Accessing Content

​Iterating Items

​Filtering by Type

​Provenance Information

​Layout Information

​Page Metadata

​Export Methods

​Markdown Export

​HTML Export

​Dictionary Export

​DocTags Export

​Working with Tables

​Concatenating Documents

​Advanced Usage

​Content Layers

​Traversal Options

​JSON Pointers

​Best Practices

​Related Topics

Serialization

Chunking

Architecture

docling-core Source

Build docs developers (and LLMs) love

Key Features

Document Structure

Content Items

Content Structure

Construction APIs

From Conversion

From Scratch

From DocTags

Document Hierarchy

Basic Nesting

Grouping

Accessing Content

Iterating Items

Filtering by Type

Provenance Information

Layout Information

Page Metadata

Export Methods

Markdown Export

HTML Export

Dictionary Export

DocTags Export

Working with Tables

Concatenating Documents

Advanced Usage

Content Layers

Traversal Options

JSON Pointers

Best Practices

Related Topics