Extract structured information from documents using Vision-Language Models with schema-based extraction.
Overview
Docling provides AI-powered information extraction to convert unstructured documents into structured data. This example demonstrates:
Defining data schemas with dictionaries or Pydantic models
Extracting data organized by page
Using default values and examples
Validating extracted data with Pydantic
The extraction API is currently in beta and may change without prior notice.
Installation
Basic Setup
from docling.datamodel.base_models import InputFormat
from docling.document_extractor import DocumentExtractor
extractor = DocumentExtractor(
allowed_formats = [InputFormat. IMAGE , InputFormat. PDF ]
)
The simplest approach uses a JSON string template:
file_path = "https://upload.wikimedia.org/wikipedia/commons/9/9f/Swiss_QR-Bill_example.jpg"
result = extractor.extract(
source = file_path,
template = '{"bill_no": "string", "total": "float"}' ,
)
print (result.pages)
# Output:
# [ExtractedPageData(
# page_no=1,
# extracted_data={'bill_no': '3139', 'total': 3949.75},
# raw_text='{"bill_no": "3139", "total": 3949.75}',
# errors=[]
# )]
For better Python integration, use a dictionary:
result = extractor.extract(
source = file_path,
template = {
"bill_no" : "string" ,
"total" : "float" ,
},
)
print (result.pages[ 0 ].extracted_data)
# {'bill_no': '3139', 'total': 3949.75}
Define Pydantic Model
Create a model with field types, defaults, and examples.
Use Model as Template
Pass the model class or instance to the extractor.
Validate Results
Load extracted data into Pydantic for validation.
from typing import Optional
from pydantic import BaseModel, Field
class Invoice ( BaseModel ):
bill_no: str = Field(
examples = [ "A123" , "5414" ]
)
total: float = Field(
default = 10 ,
examples = [ 20 ]
)
tax_id: Optional[ str ] = Field(
default = None ,
examples = [ "1234567890" ]
)
# Use model class
result = extractor.extract(
source = file_path,
template = Invoice,
)
print (result.pages[ 0 ].extracted_data)
# {'bill_no': '3139', 'total': 3949.75, 'tax_id': None}
Override Default Values
You can provide a model instance with context-specific defaults:
result = extractor.extract(
source = file_path,
template = Invoice(
bill_no = "41" ,
total = 100 ,
tax_id = "42" ,
),
)
print (result.pages[ 0 ].extracted_data)
# If tax_id not found in document, uses '42' instead of None
# {'bill_no': '3139', 'total': 3949.75, 'tax_id': '42'}
Hierarchical Pydantic Models
Define nested structures for complex documents:
class Contact ( BaseModel ):
name: Optional[ str ] = Field( default = None , examples = [ "Smith" ])
address: str = Field( default = "123 Main St" , examples = [ "456 Elm St" ])
postal_code: str = Field( default = "12345" , examples = [ "67890" ])
city: str = Field( default = "Anytown" , examples = [ "Othertown" ])
country: Optional[ str ] = Field( default = None , examples = [ "Canada" ])
class ExtendedInvoice ( BaseModel ):
bill_no: str = Field( examples = [ "A123" , "5414" ])
total: float = Field( default = 10 , examples = [ 20 ])
garden_work_hours: int = Field( default = 1 , examples = [ 2 ])
sender: Contact = Field( default = Contact(), examples = [Contact()])
receiver: Contact = Field( default = Contact(), examples = [Contact()])
result = extractor.extract(
source = file_path,
template = ExtendedInvoice,
)
print (result.pages[ 0 ].extracted_data)
# {
# 'bill_no': '3139',
# 'total': 3949.75,
# 'garden_work_hours': 28,
# 'sender': {
# 'name': 'Robert Schneider',
# 'address': 'Rue du Lac 1268',
# 'postal_code': '2501',
# 'city': 'Biel',
# 'country': 'Switzerland'
# },
# 'receiver': {
# 'name': 'Pia Rutschmann',
# 'address': 'Marktgasse 28',
# 'postal_code': '9400',
# 'city': 'Rorschach',
# 'country': 'Switzerland'
# }
# }
Validate and Load Data
Use Pydantic to validate extracted data:
invoice = ExtendedInvoice.model_validate(
result.pages[ 0 ].extracted_data
)
print (invoice)
# ExtendedInvoice(
# bill_no='3139',
# total=3949.75,
# garden_work_hours=28,
# sender=Contact(...),
# receiver=Contact(...)
# )
# Access typed fields
print ( f "Invoice # { invoice.bill_no } from { invoice.sender.name } " )
print ( f "Total: $ { invoice.total } " )
print ( f "Sender: { invoice.sender.name } , { invoice.sender.city } " )
print ( f "Receiver: { invoice.receiver.name } , { invoice.receiver.city } " )
Complete Example
from typing import Optional
from pydantic import BaseModel, Field
from docling.datamodel.base_models import InputFormat
from docling.document_extractor import DocumentExtractor
# Define schema
class Invoice ( BaseModel ):
bill_no: str = Field( examples = [ "A123" ])
total: float = Field( default = 0.0 , examples = [ 100.0 ])
date: Optional[ str ] = Field( default = None , examples = [ "2024-01-01" ])
vendor: Optional[ str ] = Field( default = None , examples = [ "ACME Corp" ])
# Create extractor
extractor = DocumentExtractor(
allowed_formats = [InputFormat. IMAGE , InputFormat. PDF ]
)
# Extract from document
file_path = "invoice.jpg"
result = extractor.extract(
source = file_path,
template = Invoice,
)
# Validate and use
for page_data in result.pages:
invoice = Invoice.model_validate(page_data.extracted_data)
print ( f "Invoice { invoice.bill_no } : $ { invoice.total } " )
print ( f "Vendor: { invoice.vendor } " )
print ( f "Date: { invoice.date } " )
# Handle errors
if page_data.errors:
print ( f "Errors: { page_data.errors } " )
Result Structure
Each extraction result contains:
class ExtractedPageData :
page_no: int # Page number
extracted_data: dict # Extracted structured data
raw_text: str # Raw model output
errors: list # Any extraction errors
Template Types
String Template
Dict Template
Pydantic Class
Pydantic Instance
template = '{"field1": "string", "field2": "int"}'
Best Practices
Use Pydantic models for complex schemas and validation
Provide examples in Field definitions to guide extraction
Set sensible defaults for optional or missing fields
Use hierarchical models for nested document structures
Validate results with Pydantic before using extracted data
PDF : Both native and scanned PDFs
Images : PNG, JPG, TIFF
Only PDF and image formats are supported for extraction.
Requirements
Python 3.9+
docling[vlm] for VLM model support
Network access to download model weights on first use