PDF Backend

Overview

The PDF backend provides advanced PDF parsing capabilities with multiple extraction strategies. It serves as the foundation for Docling’s PDF processing pipeline, extracting raw content before ML-based analysis stages.

Architecture

PDF processing in Docling uses a two-tier backend architecture:

Document Backend (PdfDocumentBackend) - Manages the PDF file and coordinates page access
Page Backend (PdfPageBackend) - Handles individual page parsing and content extraction

from docling.backend.pdf_backend import PdfDocumentBackend

# Initialize document backend
backend = PdfDocumentBackend(
    in_doc=input_document,
    path_or_stream=pdf_path,
    options=PdfBackendOptions()
)

# Access individual pages
for page_no in range(backend.page_count()):
    page_backend = backend.load_page(page_no)
    # Extract page content
    text = page_backend.get_text_in_rect(bbox)
    cells = page_backend.get_text_cells()
    page_backend.unload()

PdfDocumentBackend

Main interface for PDF document parsing.

Methods

page_count()

int

Returns the total number of pages in the PDF document.

num_pages = backend.page_count()

load_page(page_no)

PdfPageBackend

Loads a specific page for processing (0-indexed).Parameters:

page_no (int): Page index (0-based)

Returns: Page backend instance

page = backend.load_page(0)  # First page

is_valid()

bool

Check if the PDF was successfully loaded.

if backend.is_valid():
    # Process document

unload()

None

Free resources and close the PDF file.

backend.unload()

supported_formats()

set[InputFormat]

Returns {InputFormat.PDF}

PdfPageBackend

Interface for extracting content from individual PDF pages.

Methods

get_text_in_rect(bbox)

str

Extract text within a specific bounding box on the page.Parameters:

bbox (BoundingBox): Region to extract text from

Returns: Extracted text string

from docling_core.types.doc import BoundingBox

bbox = BoundingBox(l=100, t=100, r=400, b=200)
text = page.get_text_in_rect(bbox)

get_segmented_page()

SegmentedPdfPage | None

Get the segmented page representation with text cells and layout information.Returns detailed page structure including character, word, and line level text cells.

seg_page = page.get_segmented_page()
if seg_page:
    for cell in seg_page.textline_cells:
        print(cell.text)

get_text_cells()

Iterable[TextCell]

Get all text cells on the page.Returns: Iterator of TextCell objects

for cell in page.get_text_cells():
    print(f"Text: {cell.text}, BBox: {cell.bbox}")

get_bitmap_rects(scale)

Iterable[BoundingBox]

Get bounding boxes of bitmap/image regions on the page.Parameters:

scale (float): Scaling factor for coordinates (default: 1.0)

Returns: Iterator of bounding boxes

for bbox in page.get_bitmap_rects(scale=2.0):
    print(f"Image at: {bbox}")

get_page_image(scale, cropbox)

Image.Image

Render the page as an image.Parameters:

scale (float): Resolution scaling factor (default: 1.0)
cropbox (BoundingBox | None): Optional crop region

Returns: PIL Image object

# Full page at 2x resolution
img = page.get_page_image(scale=2.0)

# Cropped region
bbox = BoundingBox(l=0, t=0, r=400, b=600)
cropped = page.get_page_image(scale=1.0, cropbox=bbox)

get_size()

Size

Get the page dimensions.Returns: Size object with width and height

size = page.get_size()
print(f"Page: {size.width} x {size.height}")

is_valid()

bool

Check if the page was successfully loaded.

unload()

None

Free page resources.

PDF Backend Implementations

Docling provides multiple PDF parsing backends, each with different capabilities:

PYPDFIUM2 Backend

Standard PDF parser using the PyPDFium2 library. Characteristics:

Fast and reliable for basic text extraction
Good compatibility with most PDFs
Standard text cell extraction
Lightweight and stable

Best for:

Text-based PDFs with embedded fonts
Documents with simple layouts
Fast batch processing

DOCLING_PARSE Backend

Docling’s advanced parsing backend with enhanced capabilities. Characteristics:

Enhanced layout analysis
Better structure preservation
Improved table detection
Advanced text cell extraction
Complex layout handling

Best for:

Complex documents with multi-column layouts
Documents with tables and figures
Scientific papers and technical documents
Production environments requiring high accuracy

Configuration

Backend Options

Configure PDF backend behavior:

from docling.datamodel.backend_options import PdfBackendOptions
from pydantic import SecretStr

options = PdfBackendOptions(
    password=SecretStr("document_password"),
    enable_remote_fetch=False,
    enable_local_fetch=False
)

See PdfBackendOptions for details.

Selecting Backend

Backend selection is automatic but can be influenced through pipeline configuration:

from docling.datamodel.pipeline_options import PdfPipelineOptions

pipeline_options = PdfPipelineOptions(
    do_ocr=True,
    do_table_structure=True,
    # Backend automatically selected based on document
)

Usage Examples

Basic Text Extraction

from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
from docling.datamodel.document import InputDocument
from pathlib import Path

input_doc = InputDocument(
    path_or_stream=Path("document.pdf"),
    format=InputFormat.PDF
)

backend = PyPdfiumDocumentBackend(
    in_doc=input_doc,
    path_or_stream=Path("document.pdf")
)

if backend.is_valid():
    for page_no in range(backend.page_count()):
        page = backend.load_page(page_no)
        
        # Get all text cells
        for cell in page.get_text_cells():
            print(f"Page {page_no + 1}: {cell.text}")
        
        page.unload()
    
    backend.unload()

Extract Page Images

from PIL import Image

backend = PyPdfiumDocumentBackend(...)

for page_no in range(backend.page_count()):
    page = backend.load_page(page_no)
    
    # Render at 2x resolution
    img = page.get_page_image(scale=2.0)
    img.save(f"page_{page_no + 1}.png")
    
    page.unload()

backend.unload()

Extract Text from Specific Regions

from docling_core.types.doc import BoundingBox, CoordOrigin

page = backend.load_page(0)

# Define region (coordinates in PDF units)
header_bbox = BoundingBox(
    l=50, t=50, r=550, b=150,
    coord_origin=CoordOrigin.TOPLEFT
)

header_text = page.get_text_in_rect(header_bbox)
print(f"Header: {header_text}")

page.unload()

Process Encrypted PDFs

from docling.datamodel.backend_options import PdfBackendOptions
from pydantic import SecretStr

options = PdfBackendOptions(
    password=SecretStr("secret123")
)

backend = PyPdfiumDocumentBackend(
    in_doc=input_doc,
    path_or_stream=pdf_path,
    options=options
)

Concurrent Page Processing

import concurrent.futures

def process_page(backend, page_no):
    page = backend.load_page(page_no)
    try:
        cells = list(page.get_text_cells())
        return page_no, cells
    finally:
        page.unload()

backend = PyPdfiumDocumentBackend(...)

# Process pages in parallel
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    futures = [
        executor.submit(process_page, backend, i)
        for i in range(backend.page_count())
    ]
    
    for future in concurrent.futures.as_completed(futures):
        page_no, cells = future.result()
        print(f"Page {page_no + 1}: {len(cells)} cells")

backend.unload()

Performance Considerations

Memory Management

Always call unload() on pages after processing
Process pages sequentially for low-memory environments
Use page-level parallelism for faster processing
Monitor memory when processing large PDFs

Text Extraction Speed

get_text_cells() is faster than multiple get_text_in_rect() calls
Use segmented page for batch text access
Cache page images if used multiple times

Image Rendering

Higher scale factors increase memory usage significantly
Render only needed regions using cropbox
Use scale=1.0 for preview, scale=2.0+ for OCR

Thread Safety

Document Backend: Not thread-safe for page loading; use one instance per thread
Page Backend: Thread-safe for read operations after loading
Best Practice: Create document backend once, load pages concurrently

# Safe: One backend, concurrent page access
backend = PyPdfiumDocumentBackend(...)

def process(page_no):
    page = backend.load_page(page_no)  # Thread-safe
    result = page.get_text_cells()
    page.unload()
    return result

Troubleshooting

PDF won't load

Possible causes:

Corrupted PDF file
Unsupported PDF features
Encrypted without password
Invalid file format

Solutions:

if not backend.is_valid():
    print("Failed to load PDF")
    # Try with different backend or repair PDF

Missing text

Possible causes:

Image-based/scanned PDF
Non-embedded fonts
Encrypted content

Solutions:

Enable OCR in pipeline options
Use DOCLING_PARSE backend
Check if PDF has embedded text layer

Incorrect text order

Solution: Use layout analysis pipeline to detect reading order:

pipeline_options = PdfPipelineOptions(
    layout_options=LayoutOptions()
)

Core API

Pipelines

Options & Configuration

Backends

CLI

Overview

Architecture

PdfDocumentBackend

Methods

PdfPageBackend

Methods

PDF Backend Implementations

PYPDFIUM2 Backend

DOCLING_PARSE Backend

Configuration

Backend Options

Selecting Backend

Usage Examples

Basic Text Extraction

Extract Page Images

Extract Text from Specific Regions

Process Encrypted PDFs

Concurrent Page Processing

Performance Considerations

Thread Safety

Troubleshooting

See Also

Build docs developers (and LLMs) love

Core API

Pipelines

Options & Configuration

Backends

CLI

​Overview

​Architecture

​PdfDocumentBackend

​Methods

​PdfPageBackend

​Methods

​PDF Backend Implementations

​PYPDFIUM2 Backend

​DOCLING_PARSE Backend

​Configuration

​Backend Options

​Selecting Backend

​Usage Examples

​Basic Text Extraction

​Extract Page Images

​Extract Text from Specific Regions

​Process Encrypted PDFs

​Concurrent Page Processing

​Performance Considerations

​Thread Safety

​Troubleshooting

​See Also

Build docs developers (and LLMs) love

Overview

Architecture

PdfDocumentBackend

Methods

PdfPageBackend

Methods

PDF Backend Implementations

PYPDFIUM2 Backend

DOCLING_PARSE Backend

Configuration

Backend Options

Selecting Backend

Usage Examples

Basic Text Extraction

Extract Page Images

Extract Text from Specific Regions

Process Encrypted PDFs

Concurrent Page Processing

Performance Considerations

Thread Safety

Troubleshooting

See Also