Skip to main content

Overview

StandardPdfPipeline is a thread-safe, production-ready PDF conversion pipeline that exploits parallelism between pipeline stages and models. It provides deterministic processing with per-run isolation and explicit back-pressure control.

Key Features

  • Per-run isolation - Every execute call uses its own bounded queues and worker threads
  • Deterministic run identifiers - Pages are tracked with internal run-id to avoid conflicts
  • Explicit back-pressure & shutdown - Producers block on full queues with clean propagation
  • Minimal shared state - Models initialized once per pipeline instance, read-only access by workers
  • Thread-safe processing - Concurrent invocations never share mutable state

Class Signature

class StandardPdfPipeline(ConvertPipeline):
    def __init__(self, pipeline_options: ThreadedPdfPipelineOptions) -> None

Parameters

pipeline_options
ThreadedPdfPipelineOptions
required
Configuration options for the threaded PDF pipeline

Methods

execute

Executes the pipeline on an input document.
def execute(
    self,
    in_doc: InputDocument,
    raises_on_error: bool
) -> ConversionResult
in_doc
InputDocument
required
Input document to process
raises_on_error
bool
required
If True, raises exceptions on errors; otherwise captures them in ConversionResult
return
ConversionResult
Conversion result containing the processed document, pages, and status

get_default_options

Returns default pipeline options.
@classmethod
def get_default_options(cls) -> ThreadedPdfPipelineOptions
return
ThreadedPdfPipelineOptions
Default configuration for StandardPdfPipeline

is_backend_supported

Checks if a backend is supported by this pipeline.
@classmethod
def is_backend_supported(cls, backend: AbstractDocumentBackend) -> bool
backend
AbstractDocumentBackend
required
Backend instance to check
return
bool
True if backend is PdfDocumentBackend, False otherwise

Pipeline Stages

The StandardPdfPipeline processes documents through the following stages:
  1. Preprocessing - Lazy loading of PDF backends and page initialization
  2. OCR - Optical character recognition on page images
  3. Layout - Document layout analysis and element detection
  4. Table Structure - Table recognition and structure parsing
  5. Assembly - Page assembly and element organization
Each stage runs in a dedicated worker thread with bounded queues for back-pressure control.

Usage Example

from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline
from docling.datamodel.pipeline_options import ThreadedPdfPipelineOptions
from docling.datamodel.document import InputDocument

# Configure pipeline
options = ThreadedPdfPipelineOptions(
    do_ocr=True,
    do_table_structure=True,
    document_timeout=300.0,  # 5 minutes
    ocr_batch_size=8,
    layout_batch_size=4
)

# Create pipeline
pipeline = StandardPdfPipeline(pipeline_options=options)

# Process document
input_doc = InputDocument(path_or_stream="document.pdf")
result = pipeline.execute(input_doc, raises_on_error=False)

if result.status == ConversionStatus.SUCCESS:
    print(f"Processed {len(result.pages)} pages")
    doc = result.document

Error Handling

  • Failed pages are tracked separately and added to ConversionResult.errors
  • Timeout exceeded results in ConversionStatus.PARTIAL_SUCCESS
  • Complete failures return ConversionStatus.FAILURE
  • Worker threads are abandoned after 15s if stuck in blocking calls

Performance Considerations

  • Adjust batch sizes based on available memory and GPU capacity
  • Set document_timeout to prevent indefinite processing
  • Configure queue_max_size to balance memory usage and throughput
  • Disable unused features (OCR, table structure) to improve speed

Build docs developers (and LLMs) love