Overview
Docling’s information extraction feature enables you to extract structured data from documents using a template-based approach. This is particularly useful for:
Extracting specific fields from forms and invoices
Parsing structured documents with known schemas
Converting semi-structured documents into JSON or Pydantic models
Building document understanding pipelines
The extraction API uses vision-language models (VLMs) to understand document content and extract information according to your specified template.
The extraction API is currently experimental and may change without prior notice. Only PDF and image formats are supported.
Quick Start
from docling.document_extractor import DocumentExtractor
from pydantic import BaseModel
class Invoice ( BaseModel ):
invoice_number: str
date: str
total_amount: float
vendor_name: str
extractor = DocumentExtractor()
result = extractor.extract(
source = "invoice.pdf" ,
template = Invoice
)
print (result.pages[ 0 ].extracted_data)
Initialization
The DocumentExtractor class provides the main interface for extraction:
from docling.document_extractor import DocumentExtractor
from docling.datamodel.base_models import InputFormat
extractor = DocumentExtractor(
allowed_formats = [InputFormat. PDF , InputFormat. IMAGE ]
)
Parameters:
allowed_formats: List of input formats to process (default: all formats)
extraction_format_options: Per-format configuration (see Configuration )
Extract from a single document using the extract() method:
result = extractor.extract(
source = "document.pdf" ,
template = YourTemplate,
raises_on_error = True ,
max_num_pages = 10 ,
page_range = [ 1 , 5 ]
)
Source : ~/workspace/source/docling/document_extractor.py:126
Parameters:
source: Path (str/Path) or DocumentStream to the document
template: Extraction template (string, dict, Pydantic model, or BaseModel class)
headers: Optional HTTP headers for remote documents
raises_on_error: Whether to raise exceptions on errors (default: True)
max_num_pages: Maximum number of pages to process
max_file_size: Maximum file size in bytes
page_range: Specific page range to extract
Returns: ExtractionResult object with extracted data
Process multiple documents efficiently:
sources = [ "doc1.pdf" , "doc2.pdf" , "doc3.pdf" ]
for result in extractor.extract_all(
source = sources,
template = YourTemplate
):
print ( f "Extracted from { result.input.file.name } " )
for page in result.pages:
print (page.extracted_data)
Source : ~/workspace/source/docling/document_extractor.py:147
Using Pydantic Models
Define structured schemas using Pydantic:
from pydantic import BaseModel, Field
from typing import List, Optional
class ContactInfo ( BaseModel ):
email: str = Field( description = "Email address" )
phone: Optional[ str ] = Field( description = "Phone number" )
class Resume ( BaseModel ):
name: str
title: str
contact: ContactInfo
skills: List[ str ]
experience_years: int
result = extractor.extract(
source = "resume.pdf" ,
template = Resume
)
The VLM uses field names, types, and descriptions to guide extraction.
Using Dictionaries
For simpler use cases, use dictionary templates:
template = {
"product_name" : "string" ,
"price" : "float" ,
"in_stock" : "boolean" ,
"categories" : "list of strings"
}
result = extractor.extract(
source = "product.pdf" ,
template = template
)
Using String Prompts
Provide natural language instructions:
template = """
Extract the following information:
- Meeting title
- Date and time
- Attendees (as a list)
- Action items (as a list)
"""
result = extractor.extract(
source = "meeting_notes.pdf" ,
template = template
)
The ExtractionResult object contains:
Source : ~/workspace/source/docling/datamodel/extraction.py:25
class ExtractionResult :
input : InputDocument # Input document metadata
status: ConversionStatus # SUCCESS, FAILURE, PARTIAL_SUCCESS
errors: List[ErrorItem] # Any errors encountered
pages: List[ExtractedPageData] # Extracted data per page
Page-Level Data
Each page contains:
Source : ~/workspace/source/docling/datamodel/extraction.py:11
class ExtractedPageData :
page_no: int # 1-indexed page number
extracted_data: Optional[Dict[ str , Any]] # Structured data
raw_text: Optional[ str ] # Raw extracted text
errors: List[ str ] # Page-specific errors
Accessing Results
result = extractor.extract( source = "doc.pdf" , template = MyTemplate)
# Check status
if result.status == ConversionStatus. SUCCESS :
# Access first page data
page_data = result.pages[ 0 ].extracted_data
# Validate against Pydantic model
validated = MyTemplate( ** page_data)
print (validated.model_dump_json( indent = 2 ))
# Handle errors
for error in result.errors:
print ( f "Error in { error.component_type } : { error.error_message } " )
Configuration
Custom Pipeline Options
Configure the extraction pipeline per format:
from docling.document_extractor import (
DocumentExtractor,
ExtractionFormatOption
)
from docling.pipeline.extraction_vlm_pipeline import ExtractionVlmPipeline
from docling.datamodel.pipeline_options import PipelineOptions
from docling.datamodel.base_models import InputFormat
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
pipeline_options = PipelineOptions()
# Configure VLM settings here
extractor = DocumentExtractor(
extraction_format_options = {
InputFormat. PDF : ExtractionFormatOption(
pipeline_cls = ExtractionVlmPipeline,
pipeline_options = pipeline_options,
backend = PyPdfiumDocumentBackend
)
}
)
Source : ~/workspace/source/docling/document_extractor.py:48
Batch Processing Settings
Control concurrency and batch sizes:
from docling.datamodel.settings import settings
# Adjust batch size for extraction
settings.perf.doc_batch_size = 10
settings.perf.doc_batch_concurrency = 4
Best Practices
Use descriptive field names that match document terminology
Add Field descriptions for complex extractions
Use appropriate types (str, int, float, bool, List, Optional)
Keep templates focused on specific document types
try :
result = extractor.extract(
source = "document.pdf" ,
template = MyTemplate,
raises_on_error = True
)
except ConversionError as e:
print ( f "Extraction failed: { e } " )
# Fallback logic
# Validate extracted data
from pydantic import ValidationError
try :
validated = MyTemplate( ** result.pages[ 0 ].extracted_data)
except ValidationError as e:
print ( "Validation errors:" )
for error in e.errors():
print ( f " { error[ 'loc' ] } : { error[ 'msg' ] } " )
Examples
from pydantic import BaseModel, Field
from typing import List, Optional
from datetime import date
class LineItem ( BaseModel ):
description: str
quantity: int
unit_price: float
total: float
class Invoice ( BaseModel ):
invoice_number: str = Field( description = "Invoice ID or number" )
invoice_date: str = Field( description = "Invoice issue date" )
due_date: Optional[ str ] = Field( description = "Payment due date" )
vendor_name: str
vendor_address: Optional[ str ]
customer_name: str
line_items: List[LineItem]
subtotal: float
tax: float
total_amount: float
extractor = DocumentExtractor()
result = extractor.extract(
source = "invoice.pdf" ,
template = Invoice
)
invoice_data = Invoice( ** result.pages[ 0 ].extracted_data)
print ( f "Invoice # { invoice_data.invoice_number } " )
print ( f "Total: $ { invoice_data.total_amount } " )
class ApplicationForm ( BaseModel ):
applicant_name: str
date_of_birth: str
ssn: Optional[ str ] = Field( description = "Social Security Number" )
address: str
employment_status: str
annual_income: float
signature_present: bool = Field(
description = "Whether signature is present on form"
)
result = extractor.extract(
source = "application.pdf" ,
template = ApplicationForm
)
Multi-Page Document
# Extract from each page separately
result = extractor.extract(
source = "multi_page_report.pdf" ,
template = SectionTemplate
)
for page in result.pages:
print ( f " \n Page { page.page_no } :" )
print (page.extracted_data)
Limitations
Currently experimental API - subject to change
Only PDF and image formats supported
Extraction quality depends on VLM model capabilities
Complex tables may require specialized table extraction
Performance varies with document complexity and template structure
VLM Pipeline Learn about vision-language model pipelines
Model Catalog Available VLM models for extraction
GPU Acceleration Speed up extraction with GPU support
Pipeline Options Configure extraction pipelines