Skip to main content
Enable full-page OCR for scanned documents or when layout extraction is unreliable.

Overview

This example demonstrates:
  • Forcing full-page OCR processing
  • Switching between OCR backends (EasyOCR, Tesseract, macOS OCR, RapidOCR)
  • Enabling table structure extraction with OCR
  • Processing scanned PDFs

Basic Full-Page OCR

full_page_ocr.py
from pathlib import Path
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    TableStructureOptions,
    TesseractCliOcrOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption

input_doc_path = Path("tests/data/pdf/2206.01062.pdf")

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options = TableStructureOptions(
    do_cell_matching=True
)

# Force full-page OCR with Tesseract CLI
ocr_options = TesseractCliOcrOptions(force_full_page_ocr=True)
pipeline_options.ocr_options = ocr_options

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options,
        )
    }
)

doc = converter.convert(input_doc_path).document
print(doc.export_to_markdown())
force_full_page_ocr=True processes each page purely via OCR, which is slower than hybrid detection but more reliable for scanned documents.

OCR Backend Options

from docling.datamodel.pipeline_options import EasyOcrOptions

ocr_options = EasyOcrOptions(force_full_page_ocr=True)
pipeline_options.ocr_options = ocr_options

When to Use Full-Page OCR

Use force_full_page_ocr=True when:
  • Processing scanned PDFs without text layer
  • Layout extraction produces poor results
  • Document contains handwritten content
  • Native PDF text is corrupt or unreliable
1

Enable OCR

Set pipeline_options.do_ocr = True
2

Choose OCR Backend

Select an OCR option class and set force_full_page_ocr=True
3

Configure Table Extraction

Enable table structure recognition for better results
4

Convert Document

Process the PDF with full-page OCR enabled

Complete Example

from pathlib import Path
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    TableStructureOptions,
    EasyOcrOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption

def convert_with_full_ocr(pdf_path: Path, language: str = "en"):
    """Convert PDF using full-page OCR.
    
    Args:
        pdf_path: Path to PDF file
        language: OCR language code (e.g., 'en', 'es', 'de')
    """
    pipeline_options = PdfPipelineOptions()
    pipeline_options.do_ocr = True
    pipeline_options.do_table_structure = True
    pipeline_options.table_structure_options = TableStructureOptions(
        do_cell_matching=True
    )
    
    # Use EasyOCR with full-page mode
    ocr_options = EasyOcrOptions(
        force_full_page_ocr=True,
    )
    ocr_options.lang = [language]
    pipeline_options.ocr_options = ocr_options
    
    converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_options=pipeline_options,
            )
        }
    )
    
    result = converter.convert(pdf_path)
    return result.document

if __name__ == "__main__":
    pdf_path = Path("scanned_document.pdf")
    doc = convert_with_full_ocr(pdf_path, language="en")
    
    # Export results
    print(doc.export_to_markdown())
    
    # Save to file
    with open("output.md", "w") as f:
        f.write(doc.export_to_markdown())

Performance Considerations

  • Full-page OCR is slower than hybrid layout+OCR approach
  • GPU acceleration helps significantly with EasyOCR
  • Tesseract CLI can be faster for CPU-only scenarios
  • RapidOCR offers good CPU performance
  • macOS OCR uses native APIs on macOS

Requirements

EasyOCR

pip install docling easyocr

Tesseract

pip install docling pytesseract
# Install Tesseract binary:
# macOS: brew install tesseract
# Ubuntu: sudo apt-get install tesseract-ocr
# Windows: download installer from GitHub

macOS OCR (macOS only)

pip install docling ocrmac

RapidOCR

pip install docling rapidocr-onnxruntime

Build docs developers (and LLMs) love