Overview
PDFs are the most complex document format Docling handles. This guide covers PDF-specific features:
Table structure extraction : Detect and reconstruct complex tables
OCR integration : Extract text from scanned pages
Layout analysis : Understand document structure with deep learning models
Code and formula enrichment : Specialized extraction for technical content
PDF backends : Choose between different parsing engines
Docling uses the TableFormer model to detect tables and reconstruct their structure:
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
PdfPipelineOptions,
TableStructureOptions,
TableFormerMode,
)
from docling.document_converter import DocumentConverter, PdfFormatOption
pipeline_options = PdfPipelineOptions(
do_table_structure = True , # Enable table extraction
)
# Configure table extraction mode
pipeline_options.table_structure_options = TableStructureOptions(
do_cell_matching = True , # Match cells to PDF content
mode = TableFormerMode. ACCURATE , # Use accurate mode (slower but better)
)
converter = DocumentConverter(
format_options = {
InputFormat. PDF : PdfFormatOption( pipeline_options = pipeline_options)
}
)
result = converter.convert( "document_with_tables.pdf" )
Accurate (Recommended)
Fast
from docling.datamodel.pipeline_options import TableFormerMode
pipeline_options.table_structure_options.mode = TableFormerMode. ACCURATE
Use when:
Tables have complex layouts (merged cells, nested headers)
Accuracy is more important than speed
Processing production documents
Trade-offs:
Slower processing (~2-3x vs. FAST mode)
Higher quality results
from docling.datamodel.pipeline_options import TableFormerMode
pipeline_options.table_structure_options.mode = TableFormerMode. FAST
Use when:
Tables have simple, regular structures
Processing large batches where speed matters
Previewing or prototyping
Trade-offs:
Faster processing
May miss complex table structures
Cell Matching
Control how table cells are matched to PDF content:
from docling.datamodel.pipeline_options import TableStructureOptions
# Match cells to PDF text (default, recommended)
pipeline_options.table_structure_options = TableStructureOptions(
do_cell_matching = True # Use text from PDF
)
# Use predicted cell content from model
pipeline_options.table_structure_options = TableStructureOptions(
do_cell_matching = False # Use model predictions
)
If extracted tables have columns erroneously merged, try setting do_cell_matching=False to use the model’s predicted cell boundaries instead of matching to PDF cells.
Layout Analysis
Docling uses deep learning models to understand document structure:
from docling.datamodel.pipeline_options import (
PdfPipelineOptions,
LayoutOptions,
)
from docling.datamodel.layout_model_specs import (
DOCLING_LAYOUT_HERON , # Default, balanced
DOCLING_LAYOUT_EGRET_LARGE , # Higher accuracy
DOCLING_LAYOUT_EGRET_XLARGE , # Highest accuracy
)
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption
pipeline_options = PdfPipelineOptions()
pipeline_options.layout_options = LayoutOptions(
model_spec = DOCLING_LAYOUT_HERON , # Choose layout model
create_orphan_clusters = True , # Group isolated elements
keep_empty_clusters = False , # Remove empty regions
skip_cell_assignment = False , # Assign cells to tables
)
converter = DocumentConverter(
format_options = {
InputFormat. PDF : PdfFormatOption( pipeline_options = pipeline_options)
}
)
Available Layout Models
Model Accuracy Speed Memory Use Case DOCLING_LAYOUT_HERONGood Fast Low Default, general purpose DOCLING_LAYOUT_EGRET_LARGEBetter Slower Medium Complex layouts DOCLING_LAYOUT_EGRET_XLARGEBest Slowest High Maximum accuracy needed
PDF Backend Selection
Choose between different PDF parsing backends:
Docling Parse (Default)
PyPDFium2
from docling.backend.docling_parse_backend import DoclingParseDocumentBackend
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption
converter = DocumentConverter(
format_options = {
InputFormat. PDF : PdfFormatOption(
backend = DoclingParseDocumentBackend
)
}
)
Features:
Advanced layout analysis
Best table detection
Complex document handling
Recommended for most use cases
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption
converter = DocumentConverter(
format_options = {
InputFormat. PDF : PdfFormatOption(
backend = PyPdfiumDocumentBackend
)
}
)
Features:
Fast text extraction
Lower memory usage
Good for simple PDFs with embedded text
Use when speed is critical
Force Backend Text
Bypass layout model predictions and use native PDF text:
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption
pipeline_options = PdfPipelineOptions(
force_backend_text = True # Use PDF's embedded text directly
)
converter = DocumentConverter(
format_options = {
InputFormat. PDF : PdfFormatOption( pipeline_options = pipeline_options)
}
)
force_backend_text=True is useful for PDFs with reliable programmatic text layers. It’s faster but bypasses layout model benefits like reading order correction.
Code Enrichment
Extract and enhance code blocks with language detection:
from docling.datamodel.pipeline_options import (
PdfPipelineOptions,
CodeFormulaVlmOptions,
)
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption
pipeline_options = PdfPipelineOptions(
do_code_enrichment = True , # Enable code extraction
)
# Use default code/formula model
converter = DocumentConverter(
format_options = {
InputFormat. PDF : PdfFormatOption( pipeline_options = pipeline_options)
}
)
# Or use a specific preset
pipeline_options.code_formula_options = CodeFormulaVlmOptions.from_preset(
"codeformulav2"
)
from docling_core.types.doc import CodeItem
result = converter.convert( "technical_paper.pdf" )
for item, level in result.document.iterate_items():
if isinstance (item, CodeItem):
print ( f "Language: { item.code_language } " )
print ( f "Code: { item.text } " )
print ( "---" )
Extract mathematical formulas and convert to LaTeX:
from docling.datamodel.pipeline_options import (
PdfPipelineOptions,
CodeFormulaVlmOptions,
)
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption
pipeline_options = PdfPipelineOptions(
do_formula_enrichment = True , # Enable formula extraction
)
# Configure formula extraction model
pipeline_options.code_formula_options = CodeFormulaVlmOptions.from_preset(
"codeformulav2"
)
converter = DocumentConverter(
format_options = {
InputFormat. PDF : PdfFormatOption( pipeline_options = pipeline_options)
}
)
result = converter.convert( "math_paper.pdf" )
# Formulas are automatically rendered in HTML export
html = result.document.export_to_html()
# HTML includes MathML rendering of formulas
Page Image Generation
Render PDF pages as images:
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption
pipeline_options = PdfPipelineOptions(
generate_page_images = True , # Render pages as PNG
images_scale = 2.0 , # Scale factor (2.0 = 144 DPI)
)
converter = DocumentConverter(
format_options = {
InputFormat. PDF : PdfFormatOption( pipeline_options = pipeline_options)
}
)
result = converter.convert( "document.pdf" )
# Access page images
for page in result.document.pages:
if page.image:
page.image.save( f "page_ { page.page_no } .png" )
Extract embedded images from PDFs:
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling_core.types.doc import PictureItem
pipeline_options = PdfPipelineOptions(
generate_picture_images = True , # Extract embedded images
images_scale = 2.0 , # Image resolution scale
)
converter = DocumentConverter(
format_options = {
InputFormat. PDF : PdfFormatOption( pipeline_options = pipeline_options)
}
)
result = converter.convert( "document.pdf" )
# Save extracted pictures
for item, level in result.document.iterate_items():
if isinstance (item, PictureItem) and item.image:
item.image.save( f "picture_ { item.self_ref } .png" )
Encrypted PDFs
Handle password-protected PDFs:
from pydantic import SecretStr
from docling.datamodel.backend_options import PdfBackendOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption
backend_options = PdfBackendOptions(
password = SecretStr( "your-password-here" )
)
converter = DocumentConverter(
format_options = {
InputFormat. PDF : PdfFormatOption( backend_options = backend_options)
}
)
result = converter.convert( "encrypted.pdf" )
Batch Size Configuration
Optimize batch processing for PDFs:
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption
pipeline_options = PdfPipelineOptions(
# Batch sizes for different stages
ocr_batch_size = 4 , # Pages per OCR batch
layout_batch_size = 8 , # Pages per layout batch
table_batch_size = 4 , # Pages per table extraction batch
)
converter = DocumentConverter(
format_options = {
InputFormat. PDF : PdfFormatOption( pipeline_options = pipeline_options)
}
)
Complete Configuration Example
Putting it all together:
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
PdfPipelineOptions,
TableStructureOptions,
TableFormerMode,
LayoutOptions,
EasyOcrOptions,
)
from docling.datamodel.layout_model_specs import DOCLING_LAYOUT_HERON
from docling.backend.docling_parse_backend import DoclingParseDocumentBackend
from docling.document_converter import DocumentConverter, PdfFormatOption
# Configure all PDF processing options
pipeline_options = PdfPipelineOptions(
# Core features
do_ocr = True ,
do_table_structure = True ,
do_code_enrichment = True ,
do_formula_enrichment = True ,
# Image generation
generate_page_images = True ,
generate_picture_images = True ,
images_scale = 2.0 ,
# Performance
document_timeout = 120.0 ,
ocr_batch_size = 4 ,
layout_batch_size = 8 ,
)
# Configure table extraction
pipeline_options.table_structure_options = TableStructureOptions(
do_cell_matching = True ,
mode = TableFormerMode. ACCURATE ,
)
# Configure layout analysis
pipeline_options.layout_options = LayoutOptions(
model_spec = DOCLING_LAYOUT_HERON ,
create_orphan_clusters = True ,
)
# Configure OCR
pipeline_options.ocr_options = EasyOcrOptions(
lang = [ "en" , "fr" , "de" ],
confidence_threshold = 0.5 ,
)
converter = DocumentConverter(
format_options = {
InputFormat. PDF : PdfFormatOption(
backend = DoclingParseDocumentBackend,
pipeline_options = pipeline_options,
)
}
)
result = converter.convert( "complex_document.pdf" )
Disable unnecessary features
Only enable what you need: pipeline_options = PdfPipelineOptions(
do_ocr = False , # Disable for PDFs with text layer
do_table_structure = True , # Only if extracting tables
do_code_enrichment = False , # Only for technical docs
do_formula_enrichment = False , # Only for scientific docs
)
Use appropriate table mode
Balance speed vs. accuracy: # Fast mode for simple tables
pipeline_options.table_structure_options.mode = TableFormerMode. FAST
# Accurate mode for complex tables
pipeline_options.table_structure_options.mode = TableFormerMode. ACCURATE
Optimize batch sizes
Adjust based on your hardware: # For GPU systems
pipeline_options.ocr_batch_size = 8
pipeline_options.layout_batch_size = 16
# For CPU systems
pipeline_options.ocr_batch_size = 2
pipeline_options.layout_batch_size = 4
Set timeouts
Prevent runaway conversions: pipeline_options.document_timeout = 120.0 # 2 minutes max
Next Steps
OCR Configuration Configure OCR engines for scanned PDFs
VLM Models Use vision-language models for PDF understanding
Batch Processing Optimize PDF batch processing workflows
Export Formats Export PDFs to Markdown, HTML, JSON, and more