Skip to main content

Overview

AcceleratorOptions configures hardware acceleration for Docling’s AI models, including layout detection, OCR, table structure extraction, and vision-language models. Proper configuration can significantly improve processing speed by leveraging GPUs and optimizing CPU usage.

AcceleratorOptions

Hardware acceleration configuration for model inference. Can be configured via environment variables with DOCLING_ prefix.
from docling.datamodel.accelerator_options import AcceleratorOptions, AcceleratorDevice

options = AcceleratorOptions(
    num_threads=8,
    device="cuda",
    cuda_use_flash_attention2=True
)

Parameters

num_threads
int
default:"4"
Number of CPU threads to use for model inference. Higher values can improve throughput on multi-core systems but may increase memory usage.Can be set via DOCLING_NUM_THREADS or OMP_NUM_THREADS environment variables.Recommended: Number of physical CPU cores.
device
str | AcceleratorDevice
default:"'auto'"
Hardware device for model inference.Options:
  • auto - Automatic detection (selects best available device)
  • cpu - CPU only
  • cuda - NVIDIA GPU
  • cuda:N - Specific NVIDIA GPU (e.g., cuda:0, cuda:1)
  • mps - Apple Silicon GPU
  • xpu - Intel GPU
Auto mode selects the best available device. Can be set via DOCLING_DEVICE environment variable.
cuda_use_flash_attention2
bool
default:"False"
Enable Flash Attention 2 optimization for CUDA devices.Provides significant speedup and memory reduction for transformer models on compatible NVIDIA GPUs (Ampere or newer). Requires flash-attn package installation.Can be set via DOCLING_CUDA_USE_FLASH_ATTENTION2 environment variable.

AcceleratorDevice

Enum defining available hardware devices for model inference.
from docling.datamodel.accelerator_options import AcceleratorDevice

# Use CUDA GPU
device = AcceleratorDevice.CUDA

# Auto-detect best device
device = AcceleratorDevice.AUTO

Values

AUTO
str
Automatically detect and use the best available device (GPU if available, otherwise CPU).
CPU
str
Force CPU-only processing. Use when GPU is unavailable or for debugging.
CUDA
str
Use NVIDIA CUDA GPU for acceleration. Requires CUDA-compatible GPU and drivers.
MPS
str
Use Apple Metal Performance Shaders for GPU acceleration on Apple Silicon (M1/M2/M3).
XPU
str
Use Intel XPU for GPU acceleration on Intel GPUs.

Usage

Basic Configuration

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.accelerator_options import AcceleratorOptions

# Configure hardware acceleration
accel_options = AcceleratorOptions(
    num_threads=8,
    device="cuda:0"  # Use first GPU
)

pipeline_options = PdfPipelineOptions(
    accelerator_options=accel_options,
    do_ocr=True,
    do_table_structure=True
)

converter = DocumentConverter(
    format_options={
        PdfFormatOption: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

Auto-Detection

# Let Docling automatically select the best device
accel_options = AcceleratorOptions(
    device="auto",
    num_threads=16
)

Multi-GPU Setup

# Use specific GPU in multi-GPU system
accel_options = AcceleratorOptions(
    device="cuda:1",  # Use second GPU
    cuda_use_flash_attention2=True
)

Apple Silicon Optimization

# Optimize for Apple M1/M2/M3
accel_options = AcceleratorOptions(
    device="mps",
    num_threads=8  # Use efficiency cores
)

CPU-Only Mode

# Force CPU processing (e.g., for debugging)
accel_options = AcceleratorOptions(
    device="cpu",
    num_threads=16  # Use more threads on CPU
)

Environment Variables

AcceleratorOptions can be configured via environment variables:
# Set device
export DOCLING_DEVICE=cuda:0

# Set number of threads
export DOCLING_NUM_THREADS=8
# Or use OpenMP standard
export OMP_NUM_THREADS=8

# Enable Flash Attention 2
export DOCLING_CUDA_USE_FLASH_ATTENTION2=true
Then use defaults in code:
from docling.datamodel.accelerator_options import AcceleratorOptions

# Reads from environment variables
options = AcceleratorOptions()

Performance Recommendations

Best Configuration:
AcceleratorOptions(
    device="cuda",
    cuda_use_flash_attention2=True,  # If supported
    num_threads=4  # Lower since GPU handles heavy work
)
Requirements:
  • CUDA 11.8+ and compatible drivers
  • For Flash Attention 2: Ampere GPU or newer (RTX 30xx, A100, etc.)
  • Install: pip install flash-attn (optional, for Flash Attention 2)
Expected Speedup: 5-10x over CPU for large documents
Best Configuration:
AcceleratorOptions(
    device="mps",
    num_threads=8
)
Requirements:
  • macOS 12.3+ (Monterey or later)
  • PyTorch with MPS support
Expected Speedup: 2-4x over CPU
Best Configuration:
AcceleratorOptions(
    device="cpu",
    num_threads=16  # Match your CPU core count
)
Optimization Tips:
  • Set num_threads to number of physical cores (not hyperthreads)
  • For Intel CPUs: Consider OpenVINO backend for some models
  • Process multiple documents in parallel at application level
Best Configuration:
AcceleratorOptions(
    device="xpu",
    num_threads=4
)
Requirements:
  • Intel GPU with oneAPI support
  • Intel Extension for PyTorch

Flash Attention 2

Flash Attention 2 is an optimized attention mechanism that provides:
  • Faster inference: 2-4x speedup for transformer models
  • Lower memory: 50% reduction in GPU memory usage
  • Same accuracy: Numerically identical results

Requirements

  • NVIDIA GPU with compute capability 8.0+ (Ampere or newer)
  • CUDA 11.8+
  • Flash Attention package: pip install flash-attn

Enabling Flash Attention 2

accel_options = AcceleratorOptions(
    device="cuda",
    cuda_use_flash_attention2=True
)
Or via environment variable:
export DOCLING_CUDA_USE_FLASH_ATTENTION2=true
Flash Attention 2 is only available on compatible NVIDIA GPUs. If enabled on unsupported hardware, Docling will fall back to standard attention automatically.

Thread Configuration

The num_threads parameter controls CPU parallelism for:
  • Model inference (when using CPU device)
  • Pre/post-processing operations
  • Parallel page processing

Guidelines

1

Identify CPU cores

Find your physical core count (not hyperthreads):
# Linux
lscpu | grep "Core(s) per socket"

# macOS
sysctl hw.physicalcpu

# Python
import os
os.cpu_count() // 2  # Rough estimate
2

Set num_threads

Recommended values:
  • GPU processing: 4-8 threads (CPU handles preprocessing only)
  • CPU processing: Match physical core count
  • Shared server: Leave some cores for other processes
3

Benchmark and adjust

Test different values and measure throughput:
import time

for num_threads in [4, 8, 16]:
    options = AcceleratorOptions(num_threads=num_threads)
    start = time.time()
    # Process document
    print(f"{num_threads} threads: {time.time() - start:.2f}s")

Troubleshooting

Solutions:
  1. Reduce batch sizes in pipeline options
  2. Process fewer pages concurrently
  3. Enable Flash Attention 2 (reduces memory usage)
  4. Use a GPU with more VRAM
pipeline_options = PdfPipelineOptions(
    layout_batch_size=2,  # Reduce from default 4
    ocr_batch_size=2,
    table_structure_batch_size=2
)
Common issues:
  • Ensure macOS 12.3+ (Monterey or later)
  • Update PyTorch to latest version with MPS support
  • Some models may not support MPS - fall back to CPU
# Disable MPS warnings for EasyOCR
from docling.datamodel.pipeline_options import EasyOcrOptions

ocr = EasyOcrOptions(
    suppress_mps_warnings=True
)
Optimization checklist:
  • Set num_threads to physical core count
  • Use OMP_NUM_THREADS environment variable
  • Disable unnecessary pipeline features
  • Consider GPU acceleration
  • Process documents in parallel (application-level)
Verify device availability:
import torch

# Check CUDA
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")

# Check MPS (macOS)
print(f"MPS available: {torch.backends.mps.is_available()}")

See Also

Build docs developers (and LLMs) love