OCR Configuration

Overview

Docling supports multiple OCR (Optical Character Recognition) engines for extracting text from scanned PDFs and images:

Tesseract: Industry-standard, multilingual OCR
EasyOCR: Deep learning-based, 80+ languages
RapidOCR: Lightweight with multiple backend options
macOS Vision: Native Apple platform OCR
Auto: Automatically select the best available engine

This guide shows you how to configure OCR for your use case.

Basic OCR Setup

Enable OCR for PDF processing:

from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    EasyOcrOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption

pipeline_options = PdfPipelineOptions(
    do_ocr=True,  # Enable OCR
    ocr_options=EasyOcrOptions(
        lang=["en", "fr", "de"],
    ),
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

result = converter.convert("scanned_document.pdf")

OCR Engines

Auto-detection

Let Docling choose the best available engine:

from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    OcrAutoOptions,
)

pipeline_options = PdfPipelineOptions(
    do_ocr=True,
    ocr_options=OcrAutoOptions(),  # Auto-select engine
)

Auto-detection tries engines in this order: EasyOCR → Tesseract → RapidOCR → macOS Vision (if available). It uses the first one found.

EasyOCR

Deep learning-based OCR with GPU acceleration:

Install EasyOCR

pip install easyocr

Configure EasyOCR

from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    EasyOcrOptions,
)

pipeline_options = PdfPipelineOptions(
    do_ocr=True,
    ocr_options=EasyOcrOptions(
        lang=["en", "fr", "de", "es"],  # Language codes
        use_gpu=True,                    # Enable GPU (None = auto-detect)
        confidence_threshold=0.5,        # Min confidence (0.0-1.0)
        model_storage_directory=None,    # Custom model cache path
        recog_network="standard",        # "standard" or "craft"
        download_enabled=True,           # Allow model downloads
    ),
)

Supported Languages: 80+ languages including English, French, German, Spanish, Chinese, Japanese, Korean, Arabic, and more. See EasyOCR documentation for the full list.

Tesseract CLI

Command-line Tesseract OCR:

Install Tesseract

macOS
Ubuntu/Debian
Windows

brew install tesseract
brew install tesseract-lang  # For additional languages

sudo apt-get install tesseract-ocr
sudo apt-get install tesseract-ocr-fra  # French
sudo apt-get install tesseract-ocr-deu  # German

Configure Tesseract CLI

from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    TesseractCliOcrOptions,
)

pipeline_options = PdfPipelineOptions(
    do_ocr=True,
    ocr_options=TesseractCliOcrOptions(
        lang=["eng", "fra", "deu"],  # 3-letter ISO codes
        tesseract_cmd="tesseract",   # Path to executable
        path=None,                   # TESSDATA_PREFIX path
        psm=None,                    # Page segmentation mode (0-13)
    ),
)

Language Codes: Use 3-letter ISO 639-2 codes (e.g., eng, fra, deu, spa, chi_sim, jpn).

Tesseract (Python bindings)

Tesseract via tesserocr library:

Install tesserocr

pip install tesserocr

tesserocr requires Tesseract to be installed system-wide (see Tesseract CLI installation above).

Configure tesserocr

from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    TesseractOcrOptions,
)

pipeline_options = PdfPipelineOptions(
    do_ocr=True,
    ocr_options=TesseractOcrOptions(
        lang=["eng", "fra", "deu"],
        path=None,  # TESSDATA_PREFIX
        psm=None,   # Page segmentation mode
    ),
)

RapidOCR

Lightweight OCR with multiple backend options:

Install RapidOCR

pip install rapidocr-onnxruntime  # ONNX backend (recommended)
# OR
pip install rapidocr-openvino     # Intel OpenVINO backend
# OR
pip install rapidocr-paddle       # PaddlePaddle backend

Configure RapidOCR

from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    RapidOcrOptions,
)

pipeline_options = PdfPipelineOptions(
    do_ocr=True,
    ocr_options=RapidOcrOptions(
        lang=["english", "chinese"],  # Note: RapidOCR doesn't support lang selection yet
        backend="onnxruntime",        # "onnxruntime", "openvino", "paddle", "torch"
        text_score=0.5,               # Min confidence (0.0-1.0)
        use_det=None,                 # Enable text detection
        use_cls=None,                 # Enable text classification
        use_rec=None,                 # Enable text recognition
    ),
)

RapidOCR has known issues with read-only filesystems (e.g., Databricks). Use Tesseract or EasyOCR in such environments.

macOS Vision OCR

Native macOS OCR using Apple’s Vision framework:

Install ocrmac (macOS only)

pip install ocrmac

Configure macOS OCR

from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    OcrMacOptions,
)

pipeline_options = PdfPipelineOptions(
    do_ocr=True,
    ocr_options=OcrMacOptions(
        lang=["en-US", "fr-FR", "de-DE"],  # Locale codes
        recognition="accurate",            # "accurate" or "fast"
        framework="vision",                # Only "vision" supported
    ),
)

Language Format: Use language-REGION codes (e.g., en-US, fr-FR, de-DE, es-ES).

Full-Page OCR

Force OCR on every page (even if PDF has embedded text):

from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    TesseractCliOcrOptions,
)

pipeline_options = PdfPipelineOptions(
    do_ocr=True,
    ocr_options=TesseractCliOcrOptions(
        force_full_page_ocr=True,  # OCR entire page
        lang=["eng"],
    ),
)

Full-page OCR is slower but necessary for scanned documents with no embedded text layer. For hybrid PDFs (some pages scanned, some digital), Docling automatically detects which pages need OCR.

Bitmap Area Threshold

Control when bitmaps trigger OCR:

from docling.datamodel.pipeline_options import EasyOcrOptions

ocr_options = EasyOcrOptions(
    lang=["en"],
    bitmap_area_threshold=0.05,  # OCR if bitmap covers >5% of page
)

Lower values = more aggressive OCR. Range: 0.0 (always OCR bitmaps) to 1.0 (never OCR).

Language Detection Example

Dynamically select languages based on detection:

import pycountry
from tesserocr import PyTessBaseAPI
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    TesseractOcrOptions,
)
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption

def detect_languages(image):
    """Detect languages in an image using Tesseract."""
    with PyTessBaseAPI() as api:
        api.SetImage(image)
        detected = api.DetectOrientationScript()
        # Convert script to language codes
        # Implementation depends on your requirements
        return ["eng", "fra"]  # Example

# Detect language from first page
from PIL import Image
first_page_image = Image.open("first_page.png")
detected_langs = detect_languages(first_page_image)

pipeline_options = PdfPipelineOptions(
    do_ocr=True,
    ocr_options=TesseractOcrOptions(
        lang=detected_langs,
    ),
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

Custom OCR Models

Use custom-trained models:

RapidOCR
EasyOCR

from docling.datamodel.pipeline_options import RapidOcrOptions

ocr_options = RapidOcrOptions(
    det_model_path="/path/to/custom_det_model.onnx",
    cls_model_path="/path/to/custom_cls_model.onnx",
    rec_model_path="/path/to/custom_rec_model.onnx",
    rec_keys_path="/path/to/custom_keys.txt",
)

from docling.datamodel.pipeline_options import EasyOcrOptions

ocr_options = EasyOcrOptions(
    model_storage_directory="/path/to/custom/models",
    recog_network="custom_network_name",
)

OCR Performance Comparison

Engine	Speed	Accuracy	GPU Support	Languages	Memory
EasyOCR	Medium	High	Yes	80+	High
Tesseract	Fast	Medium-High	No	100+	Low
RapidOCR	Fast	Medium	Limited	Limited	Low
macOS Vision	Fast	High	Yes (MPS)	20+	Medium

Troubleshooting

OCR is slow

Use GPU acceleration with EasyOCR: use_gpu=True
Try RapidOCR for faster but slightly lower accuracy
Reduce OCR batch size: pipeline_options.ocr_batch_size=2
Use Tesseract for CPU-only environments

Poor OCR accuracy

Increase image scale: pipeline_options.images_scale=2.0
Use EasyOCR for better accuracy on complex layouts
Try force_full_page_ocr=True for scanned documents
Ensure correct language codes are specified

Language not detected

Verify language data files are installed (Tesseract)
Check language codes match engine requirements (ISO 639-2 for Tesseract, language names for EasyOCR)
For EasyOCR, models download automatically on first use

RapidOCR fails on Databricks

RapidOCR writes temporary files which fails on read-only filesystems.Solution: Use Tesseract or EasyOCR instead:

ocr_options = TesseractCliOcrOptions(lang=["eng"])

Next Steps

PDF Processing

Learn about PDF-specific features beyond OCR

VLM Models

Use vision-language models as an alternative to OCR

Batch Processing

Optimize OCR for large document batches

Advanced Options

Configure hardware acceleration and model caching

Get Started

Core Concepts

Usage Guides

Advanced Features

Integrations

OCR Configuration

Overview

Basic OCR Setup

OCR Engines

Auto-detection

EasyOCR

Tesseract CLI

Tesseract (Python bindings)

RapidOCR

macOS Vision OCR

Full-Page OCR

Bitmap Area Threshold

Language Detection Example

Custom OCR Models

OCR Performance Comparison

Troubleshooting

Next Steps

PDF Processing

VLM Models

Batch Processing

Advanced Options

Build docs developers (and LLMs) love

Get Started

Core Concepts

Usage Guides

Advanced Features

Integrations

​Overview

​Basic OCR Setup

​OCR Engines

​Auto-detection

​EasyOCR

​Tesseract CLI

​Tesseract (Python bindings)

​RapidOCR

​macOS Vision OCR

​Full-Page OCR

​Bitmap Area Threshold

​Language Detection Example

​Custom OCR Models

​OCR Performance Comparison

​Troubleshooting

​Next Steps

PDF Processing

VLM Models

Batch Processing

Advanced Options

Build docs developers (and LLMs) love

Overview

Basic OCR Setup

OCR Engines

Auto-detection

EasyOCR

Tesseract CLI

Tesseract (Python bindings)

RapidOCR

macOS Vision OCR

Full-Page OCR

Bitmap Area Threshold

Language Detection Example

Custom OCR Models

OCR Performance Comparison

Troubleshooting

Next Steps