Overview
AcceleratorOptions configures hardware acceleration for Docling’s AI models, including layout detection, OCR, table structure extraction, and vision-language models. Proper configuration can significantly improve processing speed by leveraging GPUs and optimizing CPU usage.
AcceleratorOptions
Hardware acceleration configuration for model inference. Can be configured via environment variables withDOCLING_ prefix.
Parameters
Number of CPU threads to use for model inference. Higher values can improve throughput on multi-core systems but may increase memory usage.Can be set via
DOCLING_NUM_THREADS or OMP_NUM_THREADS environment variables.Recommended: Number of physical CPU cores.Hardware device for model inference.Options:
auto- Automatic detection (selects best available device)cpu- CPU onlycuda- NVIDIA GPUcuda:N- Specific NVIDIA GPU (e.g.,cuda:0,cuda:1)mps- Apple Silicon GPUxpu- Intel GPU
DOCLING_DEVICE environment variable.Enable Flash Attention 2 optimization for CUDA devices.Provides significant speedup and memory reduction for transformer models on compatible NVIDIA GPUs (Ampere or newer). Requires
flash-attn package installation.Can be set via DOCLING_CUDA_USE_FLASH_ATTENTION2 environment variable.AcceleratorDevice
Enum defining available hardware devices for model inference.Values
Automatically detect and use the best available device (GPU if available, otherwise CPU).
Force CPU-only processing. Use when GPU is unavailable or for debugging.
Use NVIDIA CUDA GPU for acceleration. Requires CUDA-compatible GPU and drivers.
Use Apple Metal Performance Shaders for GPU acceleration on Apple Silicon (M1/M2/M3).
Use Intel XPU for GPU acceleration on Intel GPUs.
Usage
Basic Configuration
Auto-Detection
Multi-GPU Setup
Apple Silicon Optimization
CPU-Only Mode
Environment Variables
AcceleratorOptions can be configured via environment variables:Performance Recommendations
NVIDIA GPU (CUDA)
NVIDIA GPU (CUDA)
Best Configuration:Requirements:
- CUDA 11.8+ and compatible drivers
- For Flash Attention 2: Ampere GPU or newer (RTX 30xx, A100, etc.)
- Install:
pip install flash-attn(optional, for Flash Attention 2)
Apple Silicon (M1/M2/M3)
Apple Silicon (M1/M2/M3)
Best Configuration:Requirements:
- macOS 12.3+ (Monterey or later)
- PyTorch with MPS support
CPU Only
CPU Only
Best Configuration:Optimization Tips:
- Set
num_threadsto number of physical cores (not hyperthreads) - For Intel CPUs: Consider OpenVINO backend for some models
- Process multiple documents in parallel at application level
Intel GPU (XPU)
Intel GPU (XPU)
Best Configuration:Requirements:
- Intel GPU with oneAPI support
- Intel Extension for PyTorch
Flash Attention 2
Flash Attention 2 is an optimized attention mechanism that provides:- Faster inference: 2-4x speedup for transformer models
- Lower memory: 50% reduction in GPU memory usage
- Same accuracy: Numerically identical results
Requirements
- NVIDIA GPU with compute capability 8.0+ (Ampere or newer)
- CUDA 11.8+
- Flash Attention package:
pip install flash-attn
Enabling Flash Attention 2
Flash Attention 2 is only available on compatible NVIDIA GPUs. If enabled on unsupported hardware, Docling will fall back to standard attention automatically.
Thread Configuration
Thenum_threads parameter controls CPU parallelism for:
- Model inference (when using CPU device)
- Pre/post-processing operations
- Parallel page processing
Guidelines
Set num_threads
Recommended values:
- GPU processing: 4-8 threads (CPU handles preprocessing only)
- CPU processing: Match physical core count
- Shared server: Leave some cores for other processes
Troubleshooting
CUDA out of memory
CUDA out of memory
Solutions:
- Reduce batch sizes in pipeline options
- Process fewer pages concurrently
- Enable Flash Attention 2 (reduces memory usage)
- Use a GPU with more VRAM
MPS errors on macOS
MPS errors on macOS
Common issues:
- Ensure macOS 12.3+ (Monterey or later)
- Update PyTorch to latest version with MPS support
- Some models may not support MPS - fall back to CPU
Slow CPU performance
Slow CPU performance
Optimization checklist:
- Set
num_threadsto physical core count - Use
OMP_NUM_THREADSenvironment variable - Disable unnecessary pipeline features
- Consider GPU acceleration
- Process documents in parallel (application-level)
Device not found
Device not found
Verify device availability:
See Also
- Pipeline Options - Pipeline configuration
- GPU Acceleration - GPU optimization guide
- Installation - Installation and requirements