Overview
DoclingDocument is the core data structure in Docling that represents a converted document. It provides a rich, structured representation of the document content with support for various document elements (text, tables, figures, etc.) and multiple export formats.
This class is part of the docling-core library and is exposed through Docling’s API.
Class Definition
from docling.datamodel.document import DoclingDocument
Key Concepts
Document Items
A DoclingDocument is composed of items, where each item represents a structural element of the document:
- TextItem - Paragraphs, headings, captions
- TableItem - Tables with structure and content
- PictureItem - Images and figures
- SectionHeaderItem - Section headings with hierarchy
- ListItem - List items (ordered/unordered)
- DocItem - Generic document items
Item Labels
Each item has a label from DocItemLabel enum:
TITLE - Document title
SECTION_HEADER - Section heading
PARAGRAPH - Regular paragraph
TEXT - Plain text
LIST_ITEM - List item
TABLE - Table
PICTURE - Image or figure
CAPTION - Caption for figures/tables
FORMULA - Mathematical formula
CODE - Code block
PAGE_HEADER / PAGE_FOOTER - Headers and footers
FOOTNOTE - Footnote
- And more…
Basic Attributes
The document name or identifier.
List of text items in the document.
List of table items in the document.
List of picture items in the document.
Core Methods
iterate_items()
Iterate over all items in the document in order.
for item in doc.iterate_items():
print(f"{item.label}: {item.get_text()}")
Document items in document order.
export_to_markdown()
Export the document to Markdown format.
markdown_content = doc.export_to_markdown()
with open("output.md", "w") as f:
f.write(markdown_content)
The document content in Markdown format.
export_to_html()
Export the document to HTML format.
html_content = doc.export_to_html()
with open("output.html", "w") as f:
f.write(html_content)
The document content in HTML format.
export_to_dict()
Export the document to a dictionary representation.
json_dict = doc.export_to_dict()
import json
with open("output.json", "w") as f:
json.dump(json_dict, f, indent=2)
The document structure as a dictionary, suitable for JSON serialization.
save_as_markdown()
Save the document directly to a Markdown file.
doc.save_as_markdown("output.md")
Path where the Markdown file will be saved.
save_as_html()
Save the document directly to an HTML file.
doc.save_as_html("output.html")
Path where the HTML file will be saved.
Usage Examples
Basic Document Access
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("document.pdf")
doc = result.document
print(f"Document name: {doc.name}")
print(f"Number of texts: {len(doc.texts)}")
print(f"Number of tables: {len(doc.tables)}")
print(f"Number of pictures: {len(doc.pictures)}")
Iterating Through Items
from docling.datamodel.document import DocItemLabel
doc = result.document
# Iterate all items
for item in doc.iterate_items():
if item.label == DocItemLabel.SECTION_HEADER:
print(f"\n## {item.get_text()}")
elif item.label == DocItemLabel.PARAGRAPH:
print(item.get_text())
elif item.label == DocItemLabel.TABLE:
print("[Table found]")
Filtering by Item Type
# Get all section headers
headers = [
item for item in doc.iterate_items()
if item.label == DocItemLabel.SECTION_HEADER
]
for header in headers:
print(f"Section: {header.get_text()}")
# Get all tables
tables = [
item for item in doc.iterate_items()
if item.label == DocItemLabel.TABLE
]
print(f"Found {len(tables)} tables")
from pathlib import Path
doc = result.document
output_dir = Path("output")
output_dir.mkdir(exist_ok=True)
# Export to Markdown
markdown = doc.export_to_markdown()
(output_dir / "document.md").write_text(markdown)
# Export to HTML
html = doc.export_to_html()
(output_dir / "document.html").write_text(html)
# Export to JSON
import json
json_dict = doc.export_to_dict()
(output_dir / "document.json").write_text(
json.dumps(json_dict, indent=2)
)
Working with Tables
from docling.datamodel.document import TableItem
doc = result.document
# Access tables directly
for i, table in enumerate(doc.tables, 1):
print(f"\nTable {i}:")
# Tables have structured data
if hasattr(table, 'data'):
print(f"Rows: {len(table.data)}")
# Or iterate to find tables
for item in doc.iterate_items():
if isinstance(item, TableItem):
print(f"Found table: {item.caption or 'No caption'}")
Working with Text Content
from docling.datamodel.document import TextItem, SectionHeaderItem
doc = result.document
# Build a table of contents
toc = []
for item in doc.iterate_items():
if isinstance(item, SectionHeaderItem):
toc.append(item.get_text())
print("Table of Contents:")
for i, heading in enumerate(toc, 1):
print(f"{i}. {heading}")
# Get all text content
all_text = []
for item in doc.iterate_items():
text = item.get_text()
if text:
all_text.append(text)
full_text = "\n\n".join(all_text)
print(full_text)
Searching Document Content
import re
def search_document(doc, pattern):
"""Search for a pattern in the document."""
results = []
for item in doc.iterate_items():
text = item.get_text()
if text and re.search(pattern, text, re.IGNORECASE):
results.append({
'label': item.label,
'text': text,
'item': item
})
return results
# Search for a term
results = search_document(doc, r'artificial intelligence')
for result in results:
print(f"{result['label']}: {result['text'][:100]}...")
Document Statistics
from collections import Counter
def get_document_stats(doc):
"""Get statistics about document structure."""
label_counts = Counter()
total_text_length = 0
for item in doc.iterate_items():
label_counts[item.label] += 1
text = item.get_text()
if text:
total_text_length += len(text)
return {
'label_counts': dict(label_counts),
'total_items': sum(label_counts.values()),
'total_text_length': total_text_length
}
stats = get_document_stats(doc)
print(f"Total items: {stats['total_items']}")
print(f"Total text length: {stats['total_text_length']} characters")
print("\nItem type distribution:")
for label, count in stats['label_counts'].items():
print(f" {label}: {count}")
See Also