Overview
Docling supports GPU acceleration for significantly faster document processing. This guide covers:- Device configuration (CUDA, MPS, XPU)
- Standard pipeline GPU optimization
- VLM pipeline GPU acceleration with inference servers
- Performance benchmarks and best practices
GPU acceleration strategies are actively being improved. Check this guide regularly for updates.
Supported Devices
Source: ~/workspace/source/docling/datamodel/accelerator_options.py:14 Docling supports multiple hardware accelerators:| Device | Description | Platform |
|---|---|---|
auto | Automatic detection (recommended) | All |
cpu | CPU-only processing | All |
cuda | NVIDIA GPUs | Linux, Windows |
cuda:N | Specific NVIDIA GPU (e.g., cuda:0) | Linux, Windows |
mps | Apple Silicon GPU | macOS |
xpu | Intel GPUs | Linux |
Accelerator Configuration
Basic Setup
Source: ~/workspace/source/docs/usage/gpu.md:16Configuration Options
Source: ~/workspace/source/docling/datamodel/accelerator_options.py:23num_threads
Number of CPU threads for model inference- Higher values improve throughput on multi-core systems
- May increase memory usage
- Recommended: Number of physical CPU cores
- Can be set via
DOCLING_NUM_THREADSorOMP_NUM_THREADSenvironment variables
device
Hardware device for model inferenceauto: Automatic detection (selects best available)cpu: CPU-only processingcuda: NVIDIA GPU (default device)cuda:N: Specific NVIDIA GPUmps: Apple Silicon GPUxpu: Intel GPU
cuda_use_flash_attention2
Enable Flash Attention 2 optimization for CUDA- Significant speedup and memory reduction for transformer models
- Requires NVIDIA Ampere GPUs or newer (RTX 30XX+, A100, H100, etc.)
- Requires
flash-attnpackage installation - Only applicable to VLM models using transformers
Standard Pipeline GPU Acceleration
Configuration
Source: ~/workspace/source/docs/usage/gpu.md:13Batch Size Tuning
Higher batch sizes enable GPU batch inference mode for better throughput:OCR GPU Acceleration
Source: ~/workspace/source/docs/usage/gpu.md:46 OCR GPU support depends on the engine:Currently, only RapidOCR with the torch backend is known to support GPU acceleration. Other OCR engines rely on third-party libraries with varying GPU support.See GitHub discussion #2451 for details.
Complete Example
Source: ~/workspace/source/docs/usage/gpu.md:44 For a complete working example, see:VLM Pipeline GPU Acceleration
Inference Server Setup
Source: ~/workspace/source/docs/usage/gpu.md:62 For optimal GPU utilization with VLM pipelines, use a local inference server:Supported Servers
Starting vLLM (Linux)
Source: ~/workspace/source/docs/usage/gpu.md:71 Optimized parameters for Granite Docling:Docling VLM Configuration
Source: ~/workspace/source/docs/usage/gpu.md:84Complete Example
Source: ~/workspace/source/docs/usage/gpu.md:114 For a complete working example:Performance Benchmarks
Source: ~/workspace/source/docs/usage/gpu.md:127Test Infrastructure
| System | CPU | RAM | GPU |
|---|---|---|---|
| AWS g6e.2xlarge | 8 vCPUs AMD EPYC 7R13 | 64GB | NVIDIA L40S 48GB |
| Linux RTX 5090 | 16 vCPU AMD Ryzen 7 9800 | 128GB | NVIDIA RTX 5090 |
| Windows RTX 5070 | 16 vCPU AMD Ryzen 7 9800 | 64GB | NVIDIA RTX 5070 |
Test Data
| Dataset | Documents | Pages | Tables | Format |
|---|---|---|---|---|
| PDF doc | 1 | 192 | 95 | |
| ViDoRe V3 HR | 14 | 1,110 | 258 | Parquet (images) |
Results
Source: ~/workspace/source/docs/usage/gpu.md:152Standard Pipeline (No OCR)
| System | PDF doc | ViDoRe V3 HR |
|---|---|---|
| g6e.2xlarge | 3.1 pages/second | - |
| RTX 5090 | 7.9 pages/second | - |
| RTX 5090 (CPU-only*) | 1.5 pages/second | - |
| RTX 5070 | 4.2 pages/second | - |
| RTX 5070 (CPU-only*) | 1.2 pages/second | - |
Standard Pipeline (With OCR)
| System | PDF doc | ViDoRe V3 HR |
|---|---|---|
| RTX 5090 | TBA | 1.6 pages/second |
| RTX 5070 | TBA | 1.1 pages/second |
VLM Pipeline (Granite Docling)
| System | PDF doc | ViDoRe V3 HR |
|---|---|---|
| g6e.2xlarge | 2.4 pages/second | - |
| RTX 5090 | 3.8 pages/second | 3.6-4.5 pages/second |
| RTX 5070 | 2.0 pages/second | 2.8-3.2 pages/second |
Performance Insights
GPU vs CPU Speedup
GPU vs CPU Speedup
GPU acceleration provides:
- 5-6x speedup for standard pipeline (RTX 5090 vs CPU-only)
- 3-4x speedup for RTX 5070 vs CPU-only
- Larger speedups on newer/faster GPUs
VLM Performance
VLM Performance
- VLM pipeline benefits significantly from inference servers
- Concurrency settings critical for GPU utilization
- RTX 5090 achieves 3.6-4.5 pages/second on complex datasets
OCR Impact
OCR Impact
- OCR adds processing time but improves text extraction
- GPU-accelerated OCR (RapidOCR torch) recommended
- Consider OCR necessity for your use case
Optimization Tips
Batch Size
Increase batch sizes to maximize GPU utilization:
Concurrency
For VLM pipelines, set high concurrency:
Memory Management
Monitor GPU memory usage:
Device Selection
Use AUTO for automatic device selection:
Troubleshooting
CUDA Out of Memory
CUDA Out of Memory
Problem: GPU runs out of memorySolutions:
- Reduce batch sizes:
layout_batch_size=32 - Use smaller models
- Process fewer pages at once
- Close other GPU-using applications
GPU Not Being Used
GPU Not Being Used
Problem: Processing runs on CPU despite GPU configurationChecks:
- Verify CUDA installation:
python -c "import torch; print(torch.cuda.is_available())" - Check device configuration:
print(accelerator_options.device) - Monitor GPU usage:
nvidia-smi - Ensure correct PyTorch version with CUDA support
Slow VLM Performance
Slow VLM Performance
Problem: VLM pipeline slower than expectedChecks:
- Verify inference server is running and accessible
- Ensure
page_batch_size >= concurrency - Check server GPU utilization
- Increase concurrency if GPU underutilized
Apple Silicon (MPS) Issues
Apple Silicon (MPS) Issues
Problem: MPS acceleration not working or causing errorsSolutions:
- Ensure macOS 12.3 or later
- Check PyTorch MPS support:
python -c "import torch; print(torch.backends.mps.is_available())" - Some models may not fully support MPS (fallback to CPU)
- Use
device=AcceleratorDevice.AUTOfor automatic fallback
Flash Attention 2 Installation
Flash Attention 2 Installation
Problem: Flash Attention 2 fails to install or importRequirements:If installation fails, disable Flash Attention:
- NVIDIA Ampere GPU or newer (compute capability >= 8.0)
- CUDA 11.8 or later
- PyTorch 2.0 or later
Platform-Specific Notes
Linux
- Best GPU support across all devices (CUDA, XPU)
- vLLM inference server available
- Recommended platform for production GPU acceleration
Windows
- CUDA support available
- LM Studio and Ollama inference servers supported
- vLLM not available (Linux-only)
macOS
- Apple Silicon (M1/M2/M3) GPU via MPS
- LM Studio and Ollama inference servers supported
- Limited model support compared to CUDA
Related Resources
Model Catalog
GPU-compatible models and engines
Pipeline Options
Configure batch sizes and concurrency
VLM Pipeline
Vision-language model pipeline details
Performance Tuning
Optimize processing performance