DoclingDocument. It is defined as a Pydantic datatype that can express several features common to documents.
Key Features
TheDoclingDocument format supports:
- Rich content types: Text, tables, pictures, key-value pairs, and more
- Document hierarchy: Sections, groups, and nested structures
- Content classification: Main body vs. furniture (headers, footers)
- Layout information: Bounding boxes for all items when available
- Provenance tracking: Source page and position for each element
The definition of the Pydantic types is implemented in the module
docling_core.types.doc. See the source code definitions for complete details.Document Structure
ADoclingDocument exposes top-level fields organized into two categories:
Content Items
These fields store the actual content extracted from the document:texts: All items with text representation (paragraphs, headings, equations, etc.). Base class:TextItemtables: All tables, typeTableItem. Can carry structure annotationspictures: All pictures, typePictureItem. Can carry structure annotationskey_value_items: All key-value pairs extracted from forms or structured data
DocItem and can reference parents and children through JSON pointers.
Content Structure
These fields define the document’s hierarchical organization:body: Root node of the tree structure for the main document bodyfurniture: Root node for items that don’t belong in the body (headers, footers, page numbers)groups: Container items that don’t represent content themselves but organize other items (lists, chapters)
NodeItem instances that reference children and parents through JSON pointers.
The reading order of the document is encapsulated through the
body tree and the order of children in each node. This preserves the semantic flow of the original document.Construction APIs
Docling provides several ways to createDoclingDocument instances:
From Conversion
The most common method is throughDocumentConverter:
From Scratch
You can build documents programmatically using the construction APIs:From DocTags
DoclingDocument can be loaded from DocTags format (used by VLM pipelines):
Document Hierarchy
Basic Nesting
Items can be nested to represent document structure. For example, all content in the first page might be nested under a title item:
In this example, all items on the first page are children of the title item at #/texts/1.
Grouping
Complex structures like lists use group items to organize content:
Here, items under the heading “Let’s swim” (#/texts/5) include both text items and groups containing list elements. Group items are stored in the top-level groups field.
Accessing Content
Iterating Items
Theiterate_items() method traverses the document structure:
Filtering by Type
Access specific content types directly:Provenance Information
EachDocItem can include provenance data showing its source location:
page_no: Source page numberbbox: Bounding box coordinates on the pagecharspan: Character offset range in the original text
Layout Information
When available, layout details are preserved:Page Metadata
Page-level information is stored in thepages dictionary:
Export Methods
DoclingDocument provides built-in export to common formats:
Markdown Export
HTML Export
Dictionary Export
DocTags Export
Working with Tables
TableItem instances include structured data:
Concatenating Documents
MultipleDoclingDocument instances can be merged:
Advanced Usage
Content Layers
Items can be organized into different content layers:Traversal Options
JSON Pointers
Docling uses JSON pointers for internal references:Best Practices
Preserve provenance when modifying documents
Preserve provenance when modifying documents
When creating or modifying items, maintain provenance information for traceability back to source pages.
Use iterate_items() for traversal
Use iterate_items() for traversal
The
iterate_items() method respects document hierarchy and reading order. Avoid directly iterating doc.texts or other lists when structure matters.Check item types before accessing attributes
Check item types before accessing attributes
Use
isinstance() checks when processing mixed content types to safely access type-specific attributes.Leverage export methods for common formats
Leverage export methods for common formats
Use built-in export methods (
export_to_markdown(), etc.) for standard use cases before implementing custom serialization.Related Topics
Serialization
Learn how to serialize DoclingDocument to various formats
Chunking
Chunk documents for RAG and LLM applications
Architecture
Understand how DoclingDocument fits in Docling’s architecture
docling-core Source
Explore the complete type definitions