Skip to main content

Overview

Docling supports multiple OCR (Optical Character Recognition) engines for extracting text from scanned PDFs and images:
  • Tesseract: Industry-standard, multilingual OCR
  • EasyOCR: Deep learning-based, 80+ languages
  • RapidOCR: Lightweight with multiple backend options
  • macOS Vision: Native Apple platform OCR
  • Auto: Automatically select the best available engine
This guide shows you how to configure OCR for your use case.

Basic OCR Setup

Enable OCR for PDF processing:
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    EasyOcrOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption

pipeline_options = PdfPipelineOptions(
    do_ocr=True,  # Enable OCR
    ocr_options=EasyOcrOptions(
        lang=["en", "fr", "de"],
    ),
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

result = converter.convert("scanned_document.pdf")

OCR Engines

Auto-detection

Let Docling choose the best available engine:
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    OcrAutoOptions,
)

pipeline_options = PdfPipelineOptions(
    do_ocr=True,
    ocr_options=OcrAutoOptions(),  # Auto-select engine
)
Auto-detection tries engines in this order: EasyOCR → Tesseract → RapidOCR → macOS Vision (if available). It uses the first one found.

EasyOCR

Deep learning-based OCR with GPU acceleration:
1

Install EasyOCR

pip install easyocr
2

Configure EasyOCR

from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    EasyOcrOptions,
)

pipeline_options = PdfPipelineOptions(
    do_ocr=True,
    ocr_options=EasyOcrOptions(
        lang=["en", "fr", "de", "es"],  # Language codes
        use_gpu=True,                    # Enable GPU (None = auto-detect)
        confidence_threshold=0.5,        # Min confidence (0.0-1.0)
        model_storage_directory=None,    # Custom model cache path
        recog_network="standard",        # "standard" or "craft"
        download_enabled=True,           # Allow model downloads
    ),
)
Supported Languages: 80+ languages including English, French, German, Spanish, Chinese, Japanese, Korean, Arabic, and more. See EasyOCR documentation for the full list.

Tesseract CLI

Command-line Tesseract OCR:
1

Install Tesseract

brew install tesseract
brew install tesseract-lang  # For additional languages
2

Configure Tesseract CLI

from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    TesseractCliOcrOptions,
)

pipeline_options = PdfPipelineOptions(
    do_ocr=True,
    ocr_options=TesseractCliOcrOptions(
        lang=["eng", "fra", "deu"],  # 3-letter ISO codes
        tesseract_cmd="tesseract",   # Path to executable
        path=None,                   # TESSDATA_PREFIX path
        psm=None,                    # Page segmentation mode (0-13)
    ),
)
Language Codes: Use 3-letter ISO 639-2 codes (e.g., eng, fra, deu, spa, chi_sim, jpn).

Tesseract (Python bindings)

Tesseract via tesserocr library:
1

Install tesserocr

pip install tesserocr
tesserocr requires Tesseract to be installed system-wide (see Tesseract CLI installation above).
2

Configure tesserocr

from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    TesseractOcrOptions,
)

pipeline_options = PdfPipelineOptions(
    do_ocr=True,
    ocr_options=TesseractOcrOptions(
        lang=["eng", "fra", "deu"],
        path=None,  # TESSDATA_PREFIX
        psm=None,   # Page segmentation mode
    ),
)

RapidOCR

Lightweight OCR with multiple backend options:
1

Install RapidOCR

pip install rapidocr-onnxruntime  # ONNX backend (recommended)
# OR
pip install rapidocr-openvino     # Intel OpenVINO backend
# OR
pip install rapidocr-paddle       # PaddlePaddle backend
2

Configure RapidOCR

from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    RapidOcrOptions,
)

pipeline_options = PdfPipelineOptions(
    do_ocr=True,
    ocr_options=RapidOcrOptions(
        lang=["english", "chinese"],  # Note: RapidOCR doesn't support lang selection yet
        backend="onnxruntime",        # "onnxruntime", "openvino", "paddle", "torch"
        text_score=0.5,               # Min confidence (0.0-1.0)
        use_det=None,                 # Enable text detection
        use_cls=None,                 # Enable text classification
        use_rec=None,                 # Enable text recognition
    ),
)
RapidOCR has known issues with read-only filesystems (e.g., Databricks). Use Tesseract or EasyOCR in such environments.

macOS Vision OCR

Native macOS OCR using Apple’s Vision framework:
1

Install ocrmac (macOS only)

pip install ocrmac
2

Configure macOS OCR

from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    OcrMacOptions,
)

pipeline_options = PdfPipelineOptions(
    do_ocr=True,
    ocr_options=OcrMacOptions(
        lang=["en-US", "fr-FR", "de-DE"],  # Locale codes
        recognition="accurate",            # "accurate" or "fast"
        framework="vision",                # Only "vision" supported
    ),
)
Language Format: Use language-REGION codes (e.g., en-US, fr-FR, de-DE, es-ES).

Full-Page OCR

Force OCR on every page (even if PDF has embedded text):
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    TesseractCliOcrOptions,
)

pipeline_options = PdfPipelineOptions(
    do_ocr=True,
    ocr_options=TesseractCliOcrOptions(
        force_full_page_ocr=True,  # OCR entire page
        lang=["eng"],
    ),
)
Full-page OCR is slower but necessary for scanned documents with no embedded text layer. For hybrid PDFs (some pages scanned, some digital), Docling automatically detects which pages need OCR.

Bitmap Area Threshold

Control when bitmaps trigger OCR:
from docling.datamodel.pipeline_options import EasyOcrOptions

ocr_options = EasyOcrOptions(
    lang=["en"],
    bitmap_area_threshold=0.05,  # OCR if bitmap covers >5% of page
)
Lower values = more aggressive OCR. Range: 0.0 (always OCR bitmaps) to 1.0 (never OCR).

Language Detection Example

Dynamically select languages based on detection:
import pycountry
from tesserocr import PyTessBaseAPI
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    TesseractOcrOptions,
)
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption

def detect_languages(image):
    """Detect languages in an image using Tesseract."""
    with PyTessBaseAPI() as api:
        api.SetImage(image)
        detected = api.DetectOrientationScript()
        # Convert script to language codes
        # Implementation depends on your requirements
        return ["eng", "fra"]  # Example

# Detect language from first page
from PIL import Image
first_page_image = Image.open("first_page.png")
detected_langs = detect_languages(first_page_image)

pipeline_options = PdfPipelineOptions(
    do_ocr=True,
    ocr_options=TesseractOcrOptions(
        lang=detected_langs,
    ),
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

Custom OCR Models

Use custom-trained models:
from docling.datamodel.pipeline_options import RapidOcrOptions

ocr_options = RapidOcrOptions(
    det_model_path="/path/to/custom_det_model.onnx",
    cls_model_path="/path/to/custom_cls_model.onnx",
    rec_model_path="/path/to/custom_rec_model.onnx",
    rec_keys_path="/path/to/custom_keys.txt",
)

OCR Performance Comparison

EngineSpeedAccuracyGPU SupportLanguagesMemory
EasyOCRMediumHighYes80+High
TesseractFastMedium-HighNo100+Low
RapidOCRFastMediumLimitedLimitedLow
macOS VisionFastHighYes (MPS)20+Medium

Troubleshooting

  • Use GPU acceleration with EasyOCR: use_gpu=True
  • Try RapidOCR for faster but slightly lower accuracy
  • Reduce OCR batch size: pipeline_options.ocr_batch_size=2
  • Use Tesseract for CPU-only environments
  • Increase image scale: pipeline_options.images_scale=2.0
  • Use EasyOCR for better accuracy on complex layouts
  • Try force_full_page_ocr=True for scanned documents
  • Ensure correct language codes are specified
  • Verify language data files are installed (Tesseract)
  • Check language codes match engine requirements (ISO 639-2 for Tesseract, language names for EasyOCR)
  • For EasyOCR, models download automatically on first use
RapidOCR writes temporary files which fails on read-only filesystems.Solution: Use Tesseract or EasyOCR instead:
ocr_options = TesseractCliOcrOptions(lang=["eng"])

Next Steps

PDF Processing

Learn about PDF-specific features beyond OCR

VLM Models

Use vision-language models as an alternative to OCR

Batch Processing

Optimize OCR for large document batches

Advanced Options

Configure hardware acceleration and model caching

Build docs developers (and LLMs) love