Overview
The PDF backend provides advanced PDF parsing capabilities with multiple extraction strategies. It serves as the foundation for Docling’s PDF processing pipeline, extracting raw content before ML-based analysis stages.Architecture
PDF processing in Docling uses a two-tier backend architecture:- Document Backend (
PdfDocumentBackend) - Manages the PDF file and coordinates page access - Page Backend (
PdfPageBackend) - Handles individual page parsing and content extraction
PdfDocumentBackend
Main interface for PDF document parsing.Methods
Returns the total number of pages in the PDF document.
Loads a specific page for processing (0-indexed).Parameters:
page_no(int): Page index (0-based)
Check if the PDF was successfully loaded.
Free resources and close the PDF file.
Returns
{InputFormat.PDF}PdfPageBackend
Interface for extracting content from individual PDF pages.Methods
Extract text within a specific bounding box on the page.Parameters:
bbox(BoundingBox): Region to extract text from
Get the segmented page representation with text cells and layout information.Returns detailed page structure including character, word, and line level text cells.
Get all text cells on the page.Returns: Iterator of TextCell objects
Get bounding boxes of bitmap/image regions on the page.Parameters:
scale(float): Scaling factor for coordinates (default: 1.0)
Render the page as an image.Parameters:
scale(float): Resolution scaling factor (default: 1.0)cropbox(BoundingBox | None): Optional crop region
Get the page dimensions.Returns: Size object with width and height
Check if the page was successfully loaded.
Free page resources.
PDF Backend Implementations
Docling provides multiple PDF parsing backends, each with different capabilities:PYPDFIUM2 Backend
Standard PDF parser using the PyPDFium2 library. Characteristics:- Fast and reliable for basic text extraction
- Good compatibility with most PDFs
- Standard text cell extraction
- Lightweight and stable
- Text-based PDFs with embedded fonts
- Documents with simple layouts
- Fast batch processing
DOCLING_PARSE Backend
Docling’s advanced parsing backend with enhanced capabilities. Characteristics:- Enhanced layout analysis
- Better structure preservation
- Improved table detection
- Advanced text cell extraction
- Complex layout handling
- Complex documents with multi-column layouts
- Documents with tables and figures
- Scientific papers and technical documents
- Production environments requiring high accuracy
Configuration
Backend Options
Configure PDF backend behavior:Selecting Backend
Backend selection is automatic but can be influenced through pipeline configuration:Usage Examples
Basic Text Extraction
Extract Page Images
Extract Text from Specific Regions
Process Encrypted PDFs
Concurrent Page Processing
Performance Considerations
Memory Management
Memory Management
- Always call
unload()on pages after processing - Process pages sequentially for low-memory environments
- Use page-level parallelism for faster processing
- Monitor memory when processing large PDFs
Text Extraction Speed
Text Extraction Speed
get_text_cells()is faster than multipleget_text_in_rect()calls- Use segmented page for batch text access
- Cache page images if used multiple times
Image Rendering
Image Rendering
- Higher scale factors increase memory usage significantly
- Render only needed regions using cropbox
- Use scale=1.0 for preview, scale=2.0+ for OCR
Thread Safety
- Document Backend: Not thread-safe for page loading; use one instance per thread
- Page Backend: Thread-safe for read operations after loading
- Best Practice: Create document backend once, load pages concurrently
Troubleshooting
PDF won't load
PDF won't load
Possible causes:
- Corrupted PDF file
- Unsupported PDF features
- Encrypted without password
- Invalid file format
Missing text
Missing text
Possible causes:
- Image-based/scanned PDF
- Non-embedded fonts
- Encrypted content
- Enable OCR in pipeline options
- Use DOCLING_PARSE backend
- Check if PDF has embedded text layer
Incorrect text order
Incorrect text order
Solution:
Use layout analysis pipeline to detect reading order:
See Also
- PdfBackendOptions - Backend configuration
- PdfPipelineOptions - Pipeline settings
- Image Backend - Image processing
- OCR Options - OCR configuration