Skip to main content

Overview

PDFs are the most complex document format Docling handles. This guide covers PDF-specific features:
  • Table structure extraction: Detect and reconstruct complex tables
  • OCR integration: Extract text from scanned pages
  • Layout analysis: Understand document structure with deep learning models
  • Code and formula enrichment: Specialized extraction for technical content
  • PDF backends: Choose between different parsing engines

Table Structure Extraction

Docling uses the TableFormer model to detect tables and reconstruct their structure:
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    TableStructureOptions,
    TableFormerMode,
)
from docling.document_converter import DocumentConverter, PdfFormatOption

pipeline_options = PdfPipelineOptions(
    do_table_structure=True,  # Enable table extraction
)

# Configure table extraction mode
pipeline_options.table_structure_options = TableStructureOptions(
    do_cell_matching=True,      # Match cells to PDF content
    mode=TableFormerMode.ACCURATE,  # Use accurate mode (slower but better)
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

result = converter.convert("document_with_tables.pdf")

Table Extraction Modes

Cell Matching

Control how table cells are matched to PDF content:
from docling.datamodel.pipeline_options import TableStructureOptions

# Match cells to PDF text (default, recommended)
pipeline_options.table_structure_options = TableStructureOptions(
    do_cell_matching=True  # Use text from PDF
)

# Use predicted cell content from model
pipeline_options.table_structure_options = TableStructureOptions(
    do_cell_matching=False  # Use model predictions
)
If extracted tables have columns erroneously merged, try setting do_cell_matching=False to use the model’s predicted cell boundaries instead of matching to PDF cells.

Layout Analysis

Docling uses deep learning models to understand document structure:
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    LayoutOptions,
)
from docling.datamodel.layout_model_specs import (
    DOCLING_LAYOUT_HERON,      # Default, balanced
    DOCLING_LAYOUT_EGRET_LARGE,    # Higher accuracy
    DOCLING_LAYOUT_EGRET_XLARGE,   # Highest accuracy
)
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption

pipeline_options = PdfPipelineOptions()

pipeline_options.layout_options = LayoutOptions(
    model_spec=DOCLING_LAYOUT_HERON,  # Choose layout model
    create_orphan_clusters=True,      # Group isolated elements
    keep_empty_clusters=False,        # Remove empty regions
    skip_cell_assignment=False,       # Assign cells to tables
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

Available Layout Models

ModelAccuracySpeedMemoryUse Case
DOCLING_LAYOUT_HERONGoodFastLowDefault, general purpose
DOCLING_LAYOUT_EGRET_LARGEBetterSlowerMediumComplex layouts
DOCLING_LAYOUT_EGRET_XLARGEBestSlowestHighMaximum accuracy needed

PDF Backend Selection

Choose between different PDF parsing backends:
from docling.backend.docling_parse_backend import DoclingParseDocumentBackend
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            backend=DoclingParseDocumentBackend
        )
    }
)
Features:
  • Advanced layout analysis
  • Best table detection
  • Complex document handling
  • Recommended for most use cases

Force Backend Text

Bypass layout model predictions and use native PDF text:
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption

pipeline_options = PdfPipelineOptions(
    force_backend_text=True  # Use PDF's embedded text directly
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)
force_backend_text=True is useful for PDFs with reliable programmatic text layers. It’s faster but bypasses layout model benefits like reading order correction.

Code Enrichment

Extract and enhance code blocks with language detection:
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    CodeFormulaVlmOptions,
)
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption

pipeline_options = PdfPipelineOptions(
    do_code_enrichment=True,  # Enable code extraction
)

# Use default code/formula model
converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

# Or use a specific preset
pipeline_options.code_formula_options = CodeFormulaVlmOptions.from_preset(
    "codeformulav2"
)

Using Code Extraction Results

from docling_core.types.doc import CodeItem

result = converter.convert("technical_paper.pdf")

for item, level in result.document.iterate_items():
    if isinstance(item, CodeItem):
        print(f"Language: {item.code_language}")
        print(f"Code: {item.text}")
        print("---")

Formula Enrichment

Extract mathematical formulas and convert to LaTeX:
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    CodeFormulaVlmOptions,
)
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption

pipeline_options = PdfPipelineOptions(
    do_formula_enrichment=True,  # Enable formula extraction
)

# Configure formula extraction model
pipeline_options.code_formula_options = CodeFormulaVlmOptions.from_preset(
    "codeformulav2"
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

result = converter.convert("math_paper.pdf")

# Formulas are automatically rendered in HTML export
html = result.document.export_to_html()
# HTML includes MathML rendering of formulas

Page Image Generation

Render PDF pages as images:
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption

pipeline_options = PdfPipelineOptions(
    generate_page_images=True,  # Render pages as PNG
    images_scale=2.0,           # Scale factor (2.0 = 144 DPI)
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

result = converter.convert("document.pdf")

# Access page images
for page in result.document.pages:
    if page.image:
        page.image.save(f"page_{page.page_no}.png")

Picture Extraction

Extract embedded images from PDFs:
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling_core.types.doc import PictureItem

pipeline_options = PdfPipelineOptions(
    generate_picture_images=True,  # Extract embedded images
    images_scale=2.0,              # Image resolution scale
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

result = converter.convert("document.pdf")

# Save extracted pictures
for item, level in result.document.iterate_items():
    if isinstance(item, PictureItem) and item.image:
        item.image.save(f"picture_{item.self_ref}.png")

Encrypted PDFs

Handle password-protected PDFs:
from pydantic import SecretStr
from docling.datamodel.backend_options import PdfBackendOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption

backend_options = PdfBackendOptions(
    password=SecretStr("your-password-here")
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(backend_options=backend_options)
    }
)

result = converter.convert("encrypted.pdf")

Batch Size Configuration

Optimize batch processing for PDFs:
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption

pipeline_options = PdfPipelineOptions(
    # Batch sizes for different stages
    ocr_batch_size=4,           # Pages per OCR batch
    layout_batch_size=8,        # Pages per layout batch
    table_batch_size=4,         # Pages per table extraction batch
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

Complete Configuration Example

Putting it all together:
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    TableStructureOptions,
    TableFormerMode,
    LayoutOptions,
    EasyOcrOptions,
)
from docling.datamodel.layout_model_specs import DOCLING_LAYOUT_HERON
from docling.backend.docling_parse_backend import DoclingParseDocumentBackend
from docling.document_converter import DocumentConverter, PdfFormatOption

# Configure all PDF processing options
pipeline_options = PdfPipelineOptions(
    # Core features
    do_ocr=True,
    do_table_structure=True,
    do_code_enrichment=True,
    do_formula_enrichment=True,
    
    # Image generation
    generate_page_images=True,
    generate_picture_images=True,
    images_scale=2.0,
    
    # Performance
    document_timeout=120.0,
    ocr_batch_size=4,
    layout_batch_size=8,
)

# Configure table extraction
pipeline_options.table_structure_options = TableStructureOptions(
    do_cell_matching=True,
    mode=TableFormerMode.ACCURATE,
)

# Configure layout analysis
pipeline_options.layout_options = LayoutOptions(
    model_spec=DOCLING_LAYOUT_HERON,
    create_orphan_clusters=True,
)

# Configure OCR
pipeline_options.ocr_options = EasyOcrOptions(
    lang=["en", "fr", "de"],
    confidence_threshold=0.5,
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            backend=DoclingParseDocumentBackend,
            pipeline_options=pipeline_options,
        )
    }
)

result = converter.convert("complex_document.pdf")

Performance Tips

1

Disable unnecessary features

Only enable what you need:
pipeline_options = PdfPipelineOptions(
    do_ocr=False,              # Disable for PDFs with text layer
    do_table_structure=True,   # Only if extracting tables
    do_code_enrichment=False,  # Only for technical docs
    do_formula_enrichment=False,  # Only for scientific docs
)
2

Use appropriate table mode

Balance speed vs. accuracy:
# Fast mode for simple tables
pipeline_options.table_structure_options.mode = TableFormerMode.FAST

# Accurate mode for complex tables
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
3

Optimize batch sizes

Adjust based on your hardware:
# For GPU systems
pipeline_options.ocr_batch_size = 8
pipeline_options.layout_batch_size = 16

# For CPU systems
pipeline_options.ocr_batch_size = 2
pipeline_options.layout_batch_size = 4
4

Set timeouts

Prevent runaway conversions:
pipeline_options.document_timeout = 120.0  # 2 minutes max

Next Steps

OCR Configuration

Configure OCR engines for scanned PDFs

VLM Models

Use vision-language models for PDF understanding

Batch Processing

Optimize PDF batch processing workflows

Export Formats

Export PDFs to Markdown, HTML, JSON, and more

Build docs developers (and LLMs) love