Skip to main content

Overview

Docling supports GPU acceleration for significantly faster document processing. This guide covers:
  • Device configuration (CUDA, MPS, XPU)
  • Standard pipeline GPU optimization
  • VLM pipeline GPU acceleration with inference servers
  • Performance benchmarks and best practices
GPU acceleration strategies are actively being improved. Check this guide regularly for updates.
Source: ~/workspace/source/docs/usage/gpu.md:1

Supported Devices

Source: ~/workspace/source/docling/datamodel/accelerator_options.py:14 Docling supports multiple hardware accelerators:
DeviceDescriptionPlatform
autoAutomatic detection (recommended)All
cpuCPU-only processingAll
cudaNVIDIA GPUsLinux, Windows
cuda:NSpecific NVIDIA GPU (e.g., cuda:0)Linux, Windows
mpsApple Silicon GPUmacOS
xpuIntel GPUsLinux

Accelerator Configuration

Basic Setup

Source: ~/workspace/source/docs/usage/gpu.md:16
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions

# Automatic device selection (recommended)
accelerator_options = AcceleratorOptions(
    device=AcceleratorDevice.AUTO
)

# Explicit CUDA device
accelerator_options = AcceleratorOptions(
    device=AcceleratorDevice.CUDA
)

# Specific GPU selection
accelerator_options = AcceleratorOptions(
    device="cuda:1"  # Use second GPU
)

Configuration Options

Source: ~/workspace/source/docling/datamodel/accelerator_options.py:23
class AcceleratorOptions(BaseSettings):
    num_threads: int = 4
    device: Union[str, AcceleratorDevice] = "auto"
    cuda_use_flash_attention2: bool = False

num_threads

Number of CPU threads for model inference
  • Higher values improve throughput on multi-core systems
  • May increase memory usage
  • Recommended: Number of physical CPU cores
  • Can be set via DOCLING_NUM_THREADS or OMP_NUM_THREADS environment variables
accelerator_options = AcceleratorOptions(
    num_threads=8  # Use 8 CPU threads
)

# Or via environment variable
import os
os.environ["DOCLING_NUM_THREADS"] = "8"
Source: ~/workspace/source/docling/datamodel/accelerator_options.py:32

device

Hardware device for model inference
  • auto: Automatic detection (selects best available)
  • cpu: CPU-only processing
  • cuda: NVIDIA GPU (default device)
  • cuda:N: Specific NVIDIA GPU
  • mps: Apple Silicon GPU
  • xpu: Intel GPU
# Environment variable configuration
import os
os.environ["DOCLING_DEVICE"] = "cuda:0"

accelerator_options = AcceleratorOptions()  # Reads from env
Source: ~/workspace/source/docling/datamodel/accelerator_options.py:44

cuda_use_flash_attention2

Enable Flash Attention 2 optimization for CUDA
  • Significant speedup and memory reduction for transformer models
  • Requires NVIDIA Ampere GPUs or newer (RTX 30XX+, A100, H100, etc.)
  • Requires flash-attn package installation
  • Only applicable to VLM models using transformers
accelerator_options = AcceleratorOptions(
    device=AcceleratorDevice.CUDA,
    cuda_use_flash_attention2=True
)

# Or via environment variable
os.environ["DOCLING_CUDA_USE_FLASH_ATTENTION2"] = "true"
Source: ~/workspace/source/docling/datamodel/accelerator_options.py:56
Flash Attention 2 requires compatible hardware and additional dependencies. Install with:
pip install flash-attn

Standard Pipeline GPU Acceleration

Configuration

Source: ~/workspace/source/docs/usage/gpu.md:13
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.accelerator_options import (
    AcceleratorDevice,
    AcceleratorOptions
)
from docling.datamodel.pipeline_options import ThreadedPdfPipelineOptions

# Configure GPU acceleration
accelerator_options = AcceleratorOptions(
    device=AcceleratorDevice.CUDA,  # or AUTO
)

# Configure batch sizes for GPU
pipeline_options = ThreadedPdfPipelineOptions(
    accelerator_options=accelerator_options,
    ocr_batch_size=64,      # default: 4
    layout_batch_size=64,   # default: 4
    table_batch_size=4,     # currently not using GPU batching
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options
        )
    }
)

result = converter.convert("document.pdf")
Source: ~/workspace/source/docs/usage/gpu.md:26

Batch Size Tuning

Higher batch sizes enable GPU batch inference mode for better throughput:
pipeline_options = ThreadedPdfPipelineOptions(
    ocr_batch_size=64,      # OCR model batch size
    layout_batch_size=64,   # Layout detection batch size
    table_batch_size=4,     # Table structure (limited GPU support)
)
Start with batch size 32-64 and adjust based on your GPU memory. Monitor GPU utilization with nvidia-smi (CUDA) or Activity Monitor (macOS).

OCR GPU Acceleration

Source: ~/workspace/source/docs/usage/gpu.md:46 OCR GPU support depends on the engine:
from docling.datamodel.pipeline_options import (
    ThreadedPdfPipelineOptions,
    RapidOcrOptions
)

pipeline_options = ThreadedPdfPipelineOptions()

# RapidOCR with torch backend supports GPU
pipeline_options.ocr_options = RapidOcrOptions(
    backend="torch",  # GPU-accelerated backend
)
Currently, only RapidOCR with the torch backend is known to support GPU acceleration. Other OCR engines rely on third-party libraries with varying GPU support.See GitHub discussion #2451 for details.

Complete Example

Source: ~/workspace/source/docs/usage/gpu.md:44 For a complete working example, see:
python examples/gpu_standard_pipeline.py

VLM Pipeline GPU Acceleration

Inference Server Setup

Source: ~/workspace/source/docs/usage/gpu.md:62 For optimal GPU utilization with VLM pipelines, use a local inference server:

Supported Servers

ServerPlatformEndpoint
vLLMLinux onlyhttp://localhost:8000/v1/chat/completions
LM StudioLinux, Windows, macOShttp://localhost:1234/v1/chat/completions
OllamaLinux, Windows, macOShttp://localhost:11434/v1/chat/completions

Starting vLLM (Linux)

Source: ~/workspace/source/docs/usage/gpu.md:71 Optimized parameters for Granite Docling:
vllm serve ibm-granite/granite-docling-258M \
  --host 127.0.0.1 --port 8000 \
  --max-num-seqs 512 \
  --max-num-batched-tokens 8192 \
  --enable-chunked-prefill \
  --gpu-memory-utilization 0.9

Docling VLM Configuration

Source: ~/workspace/source/docs/usage/gpu.md:84
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import VlmPipelineOptions
from docling.datamodel.settings import settings

# Configure VLM options
vlm_options = VlmPipelineOptions(
    enable_remote_services=True,
    vlm_options={
        "url": "http://localhost:8000/v1/chat/completions",
        "params": {
            "model": "ibm-granite/granite-docling-258M",
            "max_tokens": 4096,
        },
        "concurrency": 64,  # default: 1
        "prompt": "Convert this page to docling.",
        "timeout": 90,
    }
)

# IMPORTANT: Set page_batch_size >= concurrency
settings.perf.page_batch_size = 64  # default: 4

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=vlm_options
        )
    }
)

result = converter.convert("document.pdf")
Source: ~/workspace/source/docs/usage/gpu.md:106
Critical: Ensure settings.perf.page_batch_size >= vlm_options.concurrency for optimal performance.

Complete Example

Source: ~/workspace/source/docs/usage/gpu.md:114 For a complete working example:
python examples/gpu_vlm_pipeline.py

Performance Benchmarks

Source: ~/workspace/source/docs/usage/gpu.md:127

Test Infrastructure

SystemCPURAMGPU
AWS g6e.2xlarge8 vCPUs AMD EPYC 7R1364GBNVIDIA L40S 48GB
Linux RTX 509016 vCPU AMD Ryzen 7 9800128GBNVIDIA RTX 5090
Windows RTX 507016 vCPU AMD Ryzen 7 980064GBNVIDIA RTX 5070
All systems: CUDA 13.0 Source: ~/workspace/source/docs/usage/gpu.md:139

Test Data

DatasetDocumentsPagesTablesFormat
PDF doc119295PDF
ViDoRe V3 HR141,110258Parquet (images)
Source: ~/workspace/source/docs/usage/gpu.md:129

Results

Source: ~/workspace/source/docs/usage/gpu.md:152

Standard Pipeline (No OCR)

SystemPDF docViDoRe V3 HR
g6e.2xlarge3.1 pages/second-
RTX 50907.9 pages/second-
RTX 5090 (CPU-only*)1.5 pages/second-
RTX 50704.2 pages/second-
RTX 5070 (CPU-only*)1.2 pages/second-
CPU-only timing with 16 PyTorch threads

Standard Pipeline (With OCR)

SystemPDF docViDoRe V3 HR
RTX 5090TBA1.6 pages/second
RTX 5070TBA1.1 pages/second

VLM Pipeline (Granite Docling)

SystemPDF docViDoRe V3 HR
g6e.2xlarge2.4 pages/second-
RTX 50903.8 pages/second3.6-4.5 pages/second
RTX 50702.0 pages/second2.8-3.2 pages/second

Performance Insights

GPU acceleration provides:
  • 5-6x speedup for standard pipeline (RTX 5090 vs CPU-only)
  • 3-4x speedup for RTX 5070 vs CPU-only
  • Larger speedups on newer/faster GPUs
  • VLM pipeline benefits significantly from inference servers
  • Concurrency settings critical for GPU utilization
  • RTX 5090 achieves 3.6-4.5 pages/second on complex datasets
  • OCR adds processing time but improves text extraction
  • GPU-accelerated OCR (RapidOCR torch) recommended
  • Consider OCR necessity for your use case

Optimization Tips

Batch Size

Increase batch sizes to maximize GPU utilization:
layout_batch_size=64
ocr_batch_size=64

Concurrency

For VLM pipelines, set high concurrency:
vlm_options={"concurrency": 64}
settings.perf.page_batch_size = 64

Memory Management

Monitor GPU memory usage:
# NVIDIA
nvidia-smi

# AMD
rocm-smi

Device Selection

Use AUTO for automatic device selection:
device=AcceleratorDevice.AUTO

Troubleshooting

Problem: GPU runs out of memorySolutions:
  • Reduce batch sizes: layout_batch_size=32
  • Use smaller models
  • Process fewer pages at once
  • Close other GPU-using applications
pipeline_options = ThreadedPdfPipelineOptions(
    layout_batch_size=32,  # Reduce from 64
    ocr_batch_size=32,
)
Problem: Processing runs on CPU despite GPU configurationChecks:
  1. Verify CUDA installation: python -c "import torch; print(torch.cuda.is_available())"
  2. Check device configuration: print(accelerator_options.device)
  3. Monitor GPU usage: nvidia-smi
  4. Ensure correct PyTorch version with CUDA support
# Install PyTorch with CUDA 12.1
pip install torch --index-url https://download.pytorch.org/whl/cu121
Problem: VLM pipeline slower than expectedChecks:
  • Verify inference server is running and accessible
  • Ensure page_batch_size >= concurrency
  • Check server GPU utilization
  • Increase concurrency if GPU underutilized
# Verify settings
print(f"Concurrency: {vlm_options.vlm_options['concurrency']}")
print(f"Page batch size: {settings.perf.page_batch_size}")
Problem: MPS acceleration not working or causing errorsSolutions:
  • Ensure macOS 12.3 or later
  • Check PyTorch MPS support: python -c "import torch; print(torch.backends.mps.is_available())"
  • Some models may not fully support MPS (fallback to CPU)
  • Use device=AcceleratorDevice.AUTO for automatic fallback
accelerator_options = AcceleratorOptions(
    device=AcceleratorDevice.MPS
)
Problem: Flash Attention 2 fails to install or importRequirements:
  • NVIDIA Ampere GPU or newer (compute capability >= 8.0)
  • CUDA 11.8 or later
  • PyTorch 2.0 or later
Installation:
pip install flash-attn --no-build-isolation
If installation fails, disable Flash Attention:
accelerator_options = AcceleratorOptions(
    cuda_use_flash_attention2=False
)

Platform-Specific Notes

Linux

  • Best GPU support across all devices (CUDA, XPU)
  • vLLM inference server available
  • Recommended platform for production GPU acceleration
# Check CUDA availability
nvidia-smi
python -c "import torch; print(torch.cuda.is_available())"

Windows

  • CUDA support available
  • LM Studio and Ollama inference servers supported
  • vLLM not available (Linux-only)
# Check CUDA availability
nvidia-smi
python -c "import torch; print(torch.cuda.is_available())"

macOS

  • Apple Silicon (M1/M2/M3) GPU via MPS
  • LM Studio and Ollama inference servers supported
  • Limited model support compared to CUDA
# Check MPS availability
python -c "import torch; print(torch.backends.mps.is_available())"

Model Catalog

GPU-compatible models and engines

Pipeline Options

Configure batch sizes and concurrency

VLM Pipeline

Vision-language model pipeline details

Performance Tuning

Optimize processing performance

Build docs developers (and LLMs) love