Overview
Docling supports multiple OCR (Optical Character Recognition) engines for extracting text from scanned PDFs and images:
Tesseract : Industry-standard, multilingual OCR
EasyOCR : Deep learning-based, 80+ languages
RapidOCR : Lightweight with multiple backend options
macOS Vision : Native Apple platform OCR
Auto : Automatically select the best available engine
This guide shows you how to configure OCR for your use case.
Basic OCR Setup
Enable OCR for PDF processing:
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
PdfPipelineOptions,
EasyOcrOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption
pipeline_options = PdfPipelineOptions(
do_ocr = True , # Enable OCR
ocr_options = EasyOcrOptions(
lang = [ "en" , "fr" , "de" ],
),
)
converter = DocumentConverter(
format_options = {
InputFormat. PDF : PdfFormatOption( pipeline_options = pipeline_options)
}
)
result = converter.convert( "scanned_document.pdf" )
OCR Engines
Auto-detection
Let Docling choose the best available engine:
from docling.datamodel.pipeline_options import (
PdfPipelineOptions,
OcrAutoOptions,
)
pipeline_options = PdfPipelineOptions(
do_ocr = True ,
ocr_options = OcrAutoOptions(), # Auto-select engine
)
Auto-detection tries engines in this order: EasyOCR → Tesseract → RapidOCR → macOS Vision (if available). It uses the first one found.
EasyOCR
Deep learning-based OCR with GPU acceleration:
Configure EasyOCR
from docling.datamodel.pipeline_options import (
PdfPipelineOptions,
EasyOcrOptions,
)
pipeline_options = PdfPipelineOptions(
do_ocr = True ,
ocr_options = EasyOcrOptions(
lang = [ "en" , "fr" , "de" , "es" ], # Language codes
use_gpu = True , # Enable GPU (None = auto-detect)
confidence_threshold = 0.5 , # Min confidence (0.0-1.0)
model_storage_directory = None , # Custom model cache path
recog_network = "standard" , # "standard" or "craft"
download_enabled = True , # Allow model downloads
),
)
Supported Languages: 80+ languages including English, French, German, Spanish, Chinese, Japanese, Korean, Arabic, and more. See EasyOCR documentation for the full list.
Tesseract CLI
Command-line Tesseract OCR:
Install Tesseract
macOS
Ubuntu/Debian
Windows
brew install tesseract
brew install tesseract-lang # For additional languages
sudo apt-get install tesseract-ocr
sudo apt-get install tesseract-ocr-fra # French
sudo apt-get install tesseract-ocr-deu # German
Download installer from Tesseract GitHub
Configure Tesseract CLI
from docling.datamodel.pipeline_options import (
PdfPipelineOptions,
TesseractCliOcrOptions,
)
pipeline_options = PdfPipelineOptions(
do_ocr = True ,
ocr_options = TesseractCliOcrOptions(
lang = [ "eng" , "fra" , "deu" ], # 3-letter ISO codes
tesseract_cmd = "tesseract" , # Path to executable
path = None , # TESSDATA_PREFIX path
psm = None , # Page segmentation mode (0-13)
),
)
Language Codes: Use 3-letter ISO 639-2 codes (e.g., eng, fra, deu, spa, chi_sim, jpn).
Tesseract (Python bindings)
Tesseract via tesserocr library:
Install tesserocr
tesserocr requires Tesseract to be installed system-wide (see Tesseract CLI installation above).
Configure tesserocr
from docling.datamodel.pipeline_options import (
PdfPipelineOptions,
TesseractOcrOptions,
)
pipeline_options = PdfPipelineOptions(
do_ocr = True ,
ocr_options = TesseractOcrOptions(
lang = [ "eng" , "fra" , "deu" ],
path = None , # TESSDATA_PREFIX
psm = None , # Page segmentation mode
),
)
RapidOCR
Lightweight OCR with multiple backend options:
Install RapidOCR
pip install rapidocr-onnxruntime # ONNX backend (recommended)
# OR
pip install rapidocr-openvino # Intel OpenVINO backend
# OR
pip install rapidocr-paddle # PaddlePaddle backend
Configure RapidOCR
from docling.datamodel.pipeline_options import (
PdfPipelineOptions,
RapidOcrOptions,
)
pipeline_options = PdfPipelineOptions(
do_ocr = True ,
ocr_options = RapidOcrOptions(
lang = [ "english" , "chinese" ], # Note: RapidOCR doesn't support lang selection yet
backend = "onnxruntime" , # "onnxruntime", "openvino", "paddle", "torch"
text_score = 0.5 , # Min confidence (0.0-1.0)
use_det = None , # Enable text detection
use_cls = None , # Enable text classification
use_rec = None , # Enable text recognition
),
)
RapidOCR has known issues with read-only filesystems (e.g., Databricks). Use Tesseract or EasyOCR in such environments.
macOS Vision OCR
Native macOS OCR using Apple’s Vision framework:
Install ocrmac (macOS only)
Configure macOS OCR
from docling.datamodel.pipeline_options import (
PdfPipelineOptions,
OcrMacOptions,
)
pipeline_options = PdfPipelineOptions(
do_ocr = True ,
ocr_options = OcrMacOptions(
lang = [ "en-US" , "fr-FR" , "de-DE" ], # Locale codes
recognition = "accurate" , # "accurate" or "fast"
framework = "vision" , # Only "vision" supported
),
)
Language Format: Use language-REGION codes (e.g., en-US, fr-FR, de-DE, es-ES).
Full-Page OCR
Force OCR on every page (even if PDF has embedded text):
from docling.datamodel.pipeline_options import (
PdfPipelineOptions,
TesseractCliOcrOptions,
)
pipeline_options = PdfPipelineOptions(
do_ocr = True ,
ocr_options = TesseractCliOcrOptions(
force_full_page_ocr = True , # OCR entire page
lang = [ "eng" ],
),
)
Full-page OCR is slower but necessary for scanned documents with no embedded text layer. For hybrid PDFs (some pages scanned, some digital), Docling automatically detects which pages need OCR.
Bitmap Area Threshold
Control when bitmaps trigger OCR:
from docling.datamodel.pipeline_options import EasyOcrOptions
ocr_options = EasyOcrOptions(
lang = [ "en" ],
bitmap_area_threshold = 0.05 , # OCR if bitmap covers >5% of page
)
Lower values = more aggressive OCR. Range: 0.0 (always OCR bitmaps) to 1.0 (never OCR).
Language Detection Example
Dynamically select languages based on detection:
import pycountry
from tesserocr import PyTessBaseAPI
from docling.datamodel.pipeline_options import (
PdfPipelineOptions,
TesseractOcrOptions,
)
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption
def detect_languages ( image ):
"""Detect languages in an image using Tesseract."""
with PyTessBaseAPI() as api:
api.SetImage(image)
detected = api.DetectOrientationScript()
# Convert script to language codes
# Implementation depends on your requirements
return [ "eng" , "fra" ] # Example
# Detect language from first page
from PIL import Image
first_page_image = Image.open( "first_page.png" )
detected_langs = detect_languages(first_page_image)
pipeline_options = PdfPipelineOptions(
do_ocr = True ,
ocr_options = TesseractOcrOptions(
lang = detected_langs,
),
)
converter = DocumentConverter(
format_options = {
InputFormat. PDF : PdfFormatOption( pipeline_options = pipeline_options)
}
)
Custom OCR Models
Use custom-trained models:
from docling.datamodel.pipeline_options import RapidOcrOptions
ocr_options = RapidOcrOptions(
det_model_path = "/path/to/custom_det_model.onnx" ,
cls_model_path = "/path/to/custom_cls_model.onnx" ,
rec_model_path = "/path/to/custom_rec_model.onnx" ,
rec_keys_path = "/path/to/custom_keys.txt" ,
)
from docling.datamodel.pipeline_options import EasyOcrOptions
ocr_options = EasyOcrOptions(
model_storage_directory = "/path/to/custom/models" ,
recog_network = "custom_network_name" ,
)
Engine Speed Accuracy GPU Support Languages Memory EasyOCR Medium High Yes 80+ High Tesseract Fast Medium-High No 100+ Low RapidOCR Fast Medium Limited Limited Low macOS Vision Fast High Yes (MPS) 20+ Medium
Troubleshooting
Use GPU acceleration with EasyOCR: use_gpu=True
Try RapidOCR for faster but slightly lower accuracy
Reduce OCR batch size: pipeline_options.ocr_batch_size=2
Use Tesseract for CPU-only environments
Increase image scale: pipeline_options.images_scale=2.0
Use EasyOCR for better accuracy on complex layouts
Try force_full_page_ocr=True for scanned documents
Ensure correct language codes are specified
Verify language data files are installed (Tesseract)
Check language codes match engine requirements (ISO 639-2 for Tesseract, language names for EasyOCR)
For EasyOCR, models download automatically on first use
RapidOCR fails on Databricks
RapidOCR writes temporary files which fails on read-only filesystems. Solution: Use Tesseract or EasyOCR instead:ocr_options = TesseractCliOcrOptions( lang = [ "eng" ])
Next Steps
PDF Processing Learn about PDF-specific features beyond OCR
VLM Models Use vision-language models as an alternative to OCR
Batch Processing Optimize OCR for large document batches
Advanced Options Configure hardware acceleration and model caching