Skip to main content

Overview

The PDF backend provides advanced PDF parsing capabilities with multiple extraction strategies. It serves as the foundation for Docling’s PDF processing pipeline, extracting raw content before ML-based analysis stages.

Architecture

PDF processing in Docling uses a two-tier backend architecture:
  1. Document Backend (PdfDocumentBackend) - Manages the PDF file and coordinates page access
  2. Page Backend (PdfPageBackend) - Handles individual page parsing and content extraction
from docling.backend.pdf_backend import PdfDocumentBackend

# Initialize document backend
backend = PdfDocumentBackend(
    in_doc=input_document,
    path_or_stream=pdf_path,
    options=PdfBackendOptions()
)

# Access individual pages
for page_no in range(backend.page_count()):
    page_backend = backend.load_page(page_no)
    # Extract page content
    text = page_backend.get_text_in_rect(bbox)
    cells = page_backend.get_text_cells()
    page_backend.unload()

PdfDocumentBackend

Main interface for PDF document parsing.

Methods

page_count()
int
Returns the total number of pages in the PDF document.
num_pages = backend.page_count()
load_page(page_no)
PdfPageBackend
Loads a specific page for processing (0-indexed).Parameters:
  • page_no (int): Page index (0-based)
Returns: Page backend instance
page = backend.load_page(0)  # First page
is_valid()
bool
Check if the PDF was successfully loaded.
if backend.is_valid():
    # Process document
unload()
None
Free resources and close the PDF file.
backend.unload()
supported_formats()
set[InputFormat]
Returns {InputFormat.PDF}

PdfPageBackend

Interface for extracting content from individual PDF pages.

Methods

get_text_in_rect(bbox)
str
Extract text within a specific bounding box on the page.Parameters:
  • bbox (BoundingBox): Region to extract text from
Returns: Extracted text string
from docling_core.types.doc import BoundingBox

bbox = BoundingBox(l=100, t=100, r=400, b=200)
text = page.get_text_in_rect(bbox)
get_segmented_page()
SegmentedPdfPage | None
Get the segmented page representation with text cells and layout information.Returns detailed page structure including character, word, and line level text cells.
seg_page = page.get_segmented_page()
if seg_page:
    for cell in seg_page.textline_cells:
        print(cell.text)
get_text_cells()
Iterable[TextCell]
Get all text cells on the page.Returns: Iterator of TextCell objects
for cell in page.get_text_cells():
    print(f"Text: {cell.text}, BBox: {cell.bbox}")
get_bitmap_rects(scale)
Iterable[BoundingBox]
Get bounding boxes of bitmap/image regions on the page.Parameters:
  • scale (float): Scaling factor for coordinates (default: 1.0)
Returns: Iterator of bounding boxes
for bbox in page.get_bitmap_rects(scale=2.0):
    print(f"Image at: {bbox}")
get_page_image(scale, cropbox)
Image.Image
Render the page as an image.Parameters:
  • scale (float): Resolution scaling factor (default: 1.0)
  • cropbox (BoundingBox | None): Optional crop region
Returns: PIL Image object
# Full page at 2x resolution
img = page.get_page_image(scale=2.0)

# Cropped region
bbox = BoundingBox(l=0, t=0, r=400, b=600)
cropped = page.get_page_image(scale=1.0, cropbox=bbox)
get_size()
Size
Get the page dimensions.Returns: Size object with width and height
size = page.get_size()
print(f"Page: {size.width} x {size.height}")
is_valid()
bool
Check if the page was successfully loaded.
unload()
None
Free page resources.

PDF Backend Implementations

Docling provides multiple PDF parsing backends, each with different capabilities:

PYPDFIUM2 Backend

Standard PDF parser using the PyPDFium2 library. Characteristics:
  • Fast and reliable for basic text extraction
  • Good compatibility with most PDFs
  • Standard text cell extraction
  • Lightweight and stable
Best for:
  • Text-based PDFs with embedded fonts
  • Documents with simple layouts
  • Fast batch processing

DOCLING_PARSE Backend

Docling’s advanced parsing backend with enhanced capabilities. Characteristics:
  • Enhanced layout analysis
  • Better structure preservation
  • Improved table detection
  • Advanced text cell extraction
  • Complex layout handling
Best for:
  • Complex documents with multi-column layouts
  • Documents with tables and figures
  • Scientific papers and technical documents
  • Production environments requiring high accuracy

Configuration

Backend Options

Configure PDF backend behavior:
from docling.datamodel.backend_options import PdfBackendOptions
from pydantic import SecretStr

options = PdfBackendOptions(
    password=SecretStr("document_password"),
    enable_remote_fetch=False,
    enable_local_fetch=False
)
See PdfBackendOptions for details.

Selecting Backend

Backend selection is automatic but can be influenced through pipeline configuration:
from docling.datamodel.pipeline_options import PdfPipelineOptions

pipeline_options = PdfPipelineOptions(
    do_ocr=True,
    do_table_structure=True,
    # Backend automatically selected based on document
)

Usage Examples

Basic Text Extraction

from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
from docling.datamodel.document import InputDocument
from pathlib import Path

input_doc = InputDocument(
    path_or_stream=Path("document.pdf"),
    format=InputFormat.PDF
)

backend = PyPdfiumDocumentBackend(
    in_doc=input_doc,
    path_or_stream=Path("document.pdf")
)

if backend.is_valid():
    for page_no in range(backend.page_count()):
        page = backend.load_page(page_no)
        
        # Get all text cells
        for cell in page.get_text_cells():
            print(f"Page {page_no + 1}: {cell.text}")
        
        page.unload()
    
    backend.unload()

Extract Page Images

from PIL import Image

backend = PyPdfiumDocumentBackend(...)

for page_no in range(backend.page_count()):
    page = backend.load_page(page_no)
    
    # Render at 2x resolution
    img = page.get_page_image(scale=2.0)
    img.save(f"page_{page_no + 1}.png")
    
    page.unload()

backend.unload()

Extract Text from Specific Regions

from docling_core.types.doc import BoundingBox, CoordOrigin

page = backend.load_page(0)

# Define region (coordinates in PDF units)
header_bbox = BoundingBox(
    l=50, t=50, r=550, b=150,
    coord_origin=CoordOrigin.TOPLEFT
)

header_text = page.get_text_in_rect(header_bbox)
print(f"Header: {header_text}")

page.unload()

Process Encrypted PDFs

from docling.datamodel.backend_options import PdfBackendOptions
from pydantic import SecretStr

options = PdfBackendOptions(
    password=SecretStr("secret123")
)

backend = PyPdfiumDocumentBackend(
    in_doc=input_doc,
    path_or_stream=pdf_path,
    options=options
)

Concurrent Page Processing

import concurrent.futures

def process_page(backend, page_no):
    page = backend.load_page(page_no)
    try:
        cells = list(page.get_text_cells())
        return page_no, cells
    finally:
        page.unload()

backend = PyPdfiumDocumentBackend(...)

# Process pages in parallel
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    futures = [
        executor.submit(process_page, backend, i)
        for i in range(backend.page_count())
    ]
    
    for future in concurrent.futures.as_completed(futures):
        page_no, cells = future.result()
        print(f"Page {page_no + 1}: {len(cells)} cells")

backend.unload()

Performance Considerations

  • Always call unload() on pages after processing
  • Process pages sequentially for low-memory environments
  • Use page-level parallelism for faster processing
  • Monitor memory when processing large PDFs
  • get_text_cells() is faster than multiple get_text_in_rect() calls
  • Use segmented page for batch text access
  • Cache page images if used multiple times
  • Higher scale factors increase memory usage significantly
  • Render only needed regions using cropbox
  • Use scale=1.0 for preview, scale=2.0+ for OCR

Thread Safety

  • Document Backend: Not thread-safe for page loading; use one instance per thread
  • Page Backend: Thread-safe for read operations after loading
  • Best Practice: Create document backend once, load pages concurrently
# Safe: One backend, concurrent page access
backend = PyPdfiumDocumentBackend(...)

def process(page_no):
    page = backend.load_page(page_no)  # Thread-safe
    result = page.get_text_cells()
    page.unload()
    return result

Troubleshooting

Possible causes:
  • Corrupted PDF file
  • Unsupported PDF features
  • Encrypted without password
  • Invalid file format
Solutions:
if not backend.is_valid():
    print("Failed to load PDF")
    # Try with different backend or repair PDF
Possible causes:
  • Image-based/scanned PDF
  • Non-embedded fonts
  • Encrypted content
Solutions:
  • Enable OCR in pipeline options
  • Use DOCLING_PARSE backend
  • Check if PDF has embedded text layer
Solution: Use layout analysis pipeline to detect reading order:
pipeline_options = PdfPipelineOptions(
    layout_options=LayoutOptions()
)

See Also

Build docs developers (and LLMs) love