StandardPdfPipeline

Overview

StandardPdfPipeline is a thread-safe, production-ready PDF conversion pipeline that exploits parallelism between pipeline stages and models. It provides deterministic processing with per-run isolation and explicit back-pressure control.

Key Features

Per-run isolation - Every execute call uses its own bounded queues and worker threads
Deterministic run identifiers - Pages are tracked with internal run-id to avoid conflicts
Explicit back-pressure & shutdown - Producers block on full queues with clean propagation
Minimal shared state - Models initialized once per pipeline instance, read-only access by workers
Thread-safe processing - Concurrent invocations never share mutable state

Class Signature

class StandardPdfPipeline(ConvertPipeline):
    def __init__(self, pipeline_options: ThreadedPdfPipelineOptions) -> None

Parameters

pipeline_options

ThreadedPdfPipelineOptions

required

Configuration options for the threaded PDF pipeline

Show properties

queue_max_size

int

Maximum size of queues between pipeline stages

batch_polling_interval_seconds

float

Timeout for batch polling operations

ocr_batch_size

int

Batch size for OCR processing

layout_batch_size

int

Batch size for layout model processing

table_batch_size

int

Batch size for table structure processing

document_timeout

float | None

Maximum time in seconds for document processing. If exceeded, returns PARTIAL_SUCCESS status

do_ocr

bool

default:"true"

Enable optical character recognition

do_table_structure

bool

default:"true"

Enable table structure recognition

do_code_enrichment

bool

default:"false"

Enable code block enrichment using VLM

do_formula_enrichment

bool

default:"false"

Enable formula enrichment using VLM

generate_page_images

bool

default:"false"

Generate page images in output

generate_picture_images

bool

default:"false"

Generate cropped images for picture elements

generate_table_images

bool

default:"false"

Generate cropped images for table elements (deprecated)

generate_parsed_pages

bool

default:"false"

Keep parsed page data in output

images_scale

float

default:"1.0"

Scale factor for generated images

Methods

execute

Executes the pipeline on an input document.

def execute(
    self,
    in_doc: InputDocument,
    raises_on_error: bool
) -> ConversionResult

in_doc

InputDocument

required

Input document to process

raises_on_error

bool

required

If True, raises exceptions on errors; otherwise captures them in ConversionResult

return

ConversionResult

Conversion result containing the processed document, pages, and status

get_default_options

Returns default pipeline options.

@classmethod
def get_default_options(cls) -> ThreadedPdfPipelineOptions

return

ThreadedPdfPipelineOptions

Default configuration for StandardPdfPipeline

is_backend_supported

Checks if a backend is supported by this pipeline.

@classmethod
def is_backend_supported(cls, backend: AbstractDocumentBackend) -> bool

backend

AbstractDocumentBackend

required

Backend instance to check

return

bool

True if backend is PdfDocumentBackend, False otherwise

Pipeline Stages

The StandardPdfPipeline processes documents through the following stages:

Preprocessing - Lazy loading of PDF backends and page initialization
OCR - Optical character recognition on page images
Layout - Document layout analysis and element detection
Table Structure - Table recognition and structure parsing
Assembly - Page assembly and element organization

Each stage runs in a dedicated worker thread with bounded queues for back-pressure control.

Usage Example

from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline
from docling.datamodel.pipeline_options import ThreadedPdfPipelineOptions
from docling.datamodel.document import InputDocument

# Configure pipeline
options = ThreadedPdfPipelineOptions(
    do_ocr=True,
    do_table_structure=True,
    document_timeout=300.0,  # 5 minutes
    ocr_batch_size=8,
    layout_batch_size=4
)

# Create pipeline
pipeline = StandardPdfPipeline(pipeline_options=options)

# Process document
input_doc = InputDocument(path_or_stream="document.pdf")
result = pipeline.execute(input_doc, raises_on_error=False)

if result.status == ConversionStatus.SUCCESS:
    print(f"Processed {len(result.pages)} pages")
    doc = result.document

Error Handling

Failed pages are tracked separately and added to ConversionResult.errors
Timeout exceeded results in ConversionStatus.PARTIAL_SUCCESS
Complete failures return ConversionStatus.FAILURE
Worker threads are abandoned after 15s if stuck in blocking calls

Performance Considerations

Adjust batch sizes based on available memory and GPU capacity
Set document_timeout to prevent indefinite processing
Configure queue_max_size to balance memory usage and throughput
Disable unused features (OCR, table structure) to improve speed

Core API

Pipelines

Options & Configuration

Backends

CLI

StandardPdfPipeline

Overview

Key Features

Class Signature

Parameters

Methods

execute

get_default_options

is_backend_supported

Pipeline Stages

Usage Example

Error Handling

Performance Considerations

Build docs developers (and LLMs) love

Core API

Pipelines

Options & Configuration

Backends

CLI

​Overview

​Key Features

​Class Signature

​Parameters

​Methods

​execute

​get_default_options

​is_backend_supported

​Pipeline Stages

​Usage Example

​Error Handling

​Performance Considerations

Build docs developers (and LLMs) love

Overview

Key Features

Class Signature

Parameters

Methods

execute

get_default_options

is_backend_supported

Pipeline Stages

Usage Example

Error Handling

Performance Considerations