Information Extraction

Extract structured information from documents using Vision-Language Models with schema-based extraction.

Overview

Docling provides AI-powered information extraction to convert unstructured documents into structured data. This example demonstrates:

Defining data schemas with dictionaries or Pydantic models
Extracting data organized by page
Using default values and examples
Validating extracted data with Pydantic

The extraction API is currently in beta and may change without prior notice.

Installation

pip install docling[vlm]

Basic Setup

from docling.datamodel.base_models import InputFormat
from docling.document_extractor import DocumentExtractor

extractor = DocumentExtractor(
    allowed_formats=[InputFormat.IMAGE, InputFormat.PDF]
)

Extract with String Template

The simplest approach uses a JSON string template:

extraction.ipynb

file_path = "https://upload.wikimedia.org/wikipedia/commons/9/9f/Swiss_QR-Bill_example.jpg"

result = extractor.extract(
    source=file_path,
    template='{"bill_no": "string", "total": "float"}',
)

print(result.pages)
# Output:
# [ExtractedPageData(
#     page_no=1,
#     extracted_data={'bill_no': '3139', 'total': 3949.75},
#     raw_text='{"bill_no": "3139", "total": 3949.75}',
#     errors=[]
# )]

Extract with Dict Template

For better Python integration, use a dictionary:

result = extractor.extract(
    source=file_path,
    template={
        "bill_no": "string",
        "total": "float",
    },
)

print(result.pages[0].extracted_data)
# {'bill_no': '3139', 'total': 3949.75}

Extract with Pydantic Model

Define Pydantic Model

Create a model with field types, defaults, and examples.

Use Model as Template

Pass the model class or instance to the extractor.

Validate Results

Load extracted data into Pydantic for validation.

from typing import Optional
from pydantic import BaseModel, Field

class Invoice(BaseModel):
    bill_no: str = Field(
        examples=["A123", "5414"]
    )
    total: float = Field(
        default=10,
        examples=[20]
    )
    tax_id: Optional[str] = Field(
        default=None,
        examples=["1234567890"]
    )

# Use model class
result = extractor.extract(
    source=file_path,
    template=Invoice,
)

print(result.pages[0].extracted_data)
# {'bill_no': '3139', 'total': 3949.75, 'tax_id': None}

Override Default Values

You can provide a model instance with context-specific defaults:

result = extractor.extract(
    source=file_path,
    template=Invoice(
        bill_no="41",
        total=100,
        tax_id="42",
    ),
)

print(result.pages[0].extracted_data)
# If tax_id not found in document, uses '42' instead of None
# {'bill_no': '3139', 'total': 3949.75, 'tax_id': '42'}

Hierarchical Pydantic Models

Define nested structures for complex documents:

class Contact(BaseModel):
    name: Optional[str] = Field(default=None, examples=["Smith"])
    address: str = Field(default="123 Main St", examples=["456 Elm St"])
    postal_code: str = Field(default="12345", examples=["67890"])
    city: str = Field(default="Anytown", examples=["Othertown"])
    country: Optional[str] = Field(default=None, examples=["Canada"])

class ExtendedInvoice(BaseModel):
    bill_no: str = Field(examples=["A123", "5414"])
    total: float = Field(default=10, examples=[20])
    garden_work_hours: int = Field(default=1, examples=[2])
    sender: Contact = Field(default=Contact(), examples=[Contact()])
    receiver: Contact = Field(default=Contact(), examples=[Contact()])

result = extractor.extract(
    source=file_path,
    template=ExtendedInvoice,
)

print(result.pages[0].extracted_data)
# {
#     'bill_no': '3139',
#     'total': 3949.75,
#     'garden_work_hours': 28,
#     'sender': {
#         'name': 'Robert Schneider',
#         'address': 'Rue du Lac 1268',
#         'postal_code': '2501',
#         'city': 'Biel',
#         'country': 'Switzerland'
#     },
#     'receiver': {
#         'name': 'Pia Rutschmann',
#         'address': 'Marktgasse 28',
#         'postal_code': '9400',
#         'city': 'Rorschach',
#         'country': 'Switzerland'
#     }
# }

Validate and Load Data

Use Pydantic to validate extracted data:

invoice = ExtendedInvoice.model_validate(
    result.pages[0].extracted_data
)

print(invoice)
# ExtendedInvoice(
#     bill_no='3139',
#     total=3949.75,
#     garden_work_hours=28,
#     sender=Contact(...),
#     receiver=Contact(...)
# )

# Access typed fields
print(f"Invoice #{invoice.bill_no} from {invoice.sender.name}")
print(f"Total: ${invoice.total}")
print(f"Sender: {invoice.sender.name}, {invoice.sender.city}")
print(f"Receiver: {invoice.receiver.name}, {invoice.receiver.city}")

Complete Example

from typing import Optional
from pydantic import BaseModel, Field
from docling.datamodel.base_models import InputFormat
from docling.document_extractor import DocumentExtractor

# Define schema
class Invoice(BaseModel):
    bill_no: str = Field(examples=["A123"])
    total: float = Field(default=0.0, examples=[100.0])
    date: Optional[str] = Field(default=None, examples=["2024-01-01"])
    vendor: Optional[str] = Field(default=None, examples=["ACME Corp"])

# Create extractor
extractor = DocumentExtractor(
    allowed_formats=[InputFormat.IMAGE, InputFormat.PDF]
)

# Extract from document
file_path = "invoice.jpg"
result = extractor.extract(
    source=file_path,
    template=Invoice,
)

# Validate and use
for page_data in result.pages:
    invoice = Invoice.model_validate(page_data.extracted_data)
    print(f"Invoice {invoice.bill_no}: ${invoice.total}")
    print(f"Vendor: {invoice.vendor}")
    print(f"Date: {invoice.date}")
    
    # Handle errors
    if page_data.errors:
        print(f"Errors: {page_data.errors}")

Result Structure

Each extraction result contains:

class ExtractedPageData:
    page_no: int                    # Page number
    extracted_data: dict            # Extracted structured data
    raw_text: str                   # Raw model output
    errors: list                    # Any extraction errors

Template Types

template = '{"field1": "string", "field2": "int"}'

Best Practices

Use Pydantic models for complex schemas and validation
Provide examples in Field definitions to guide extraction
Set sensible defaults for optional or missing fields
Use hierarchical models for nested document structures
Validate results with Pydantic before using extracted data

Supported Input Formats

PDF: Both native and scanned PDFs
Images: PNG, JPG, TIFF

Only PDF and image formats are supported for extraction.

Requirements

Python 3.9+
docling[vlm] for VLM model support
Network access to download model weights on first use

Conversion

Advanced Processing

RAG & AI Workflows

Information Extraction

Overview

Installation

Basic Setup

Extract with String Template

Extract with Dict Template

Extract with Pydantic Model

Override Default Values

Hierarchical Pydantic Models

Validate and Load Data

Complete Example

Result Structure

Template Types

Best Practices

Supported Input Formats

Requirements

Build docs developers (and LLMs) love

Conversion

Advanced Processing

RAG & AI Workflows

​Overview

​Installation

​Basic Setup

​Extract with String Template

​Extract with Dict Template

​Extract with Pydantic Model

​Override Default Values

​Hierarchical Pydantic Models

​Validate and Load Data

​Complete Example

​Result Structure

​Template Types

​Best Practices

​Supported Input Formats

​Requirements

Build docs developers (and LLMs) love

Overview

Installation

Basic Setup

Extract with String Template

Extract with Dict Template

Extract with Pydantic Model

Override Default Values

Hierarchical Pydantic Models

Validate and Load Data

Complete Example

Result Structure

Template Types

Best Practices

Supported Input Formats

Requirements