Overview
StandardPdfPipeline is a thread-safe, production-ready PDF conversion pipeline that exploits parallelism between pipeline stages and models. It provides deterministic processing with per-run isolation and explicit back-pressure control.
Key Features
- Per-run isolation - Every execute call uses its own bounded queues and worker threads
- Deterministic run identifiers - Pages are tracked with internal run-id to avoid conflicts
- Explicit back-pressure & shutdown - Producers block on full queues with clean propagation
- Minimal shared state - Models initialized once per pipeline instance, read-only access by workers
- Thread-safe processing - Concurrent invocations never share mutable state
Class Signature
Parameters
Configuration options for the threaded PDF pipeline
Methods
execute
Executes the pipeline on an input document.Input document to process
If True, raises exceptions on errors; otherwise captures them in ConversionResult
Conversion result containing the processed document, pages, and status
get_default_options
Returns default pipeline options.Default configuration for StandardPdfPipeline
is_backend_supported
Checks if a backend is supported by this pipeline.Backend instance to check
True if backend is PdfDocumentBackend, False otherwise
Pipeline Stages
The StandardPdfPipeline processes documents through the following stages:- Preprocessing - Lazy loading of PDF backends and page initialization
- OCR - Optical character recognition on page images
- Layout - Document layout analysis and element detection
- Table Structure - Table recognition and structure parsing
- Assembly - Page assembly and element organization
Usage Example
Error Handling
- Failed pages are tracked separately and added to
ConversionResult.errors - Timeout exceeded results in
ConversionStatus.PARTIAL_SUCCESS - Complete failures return
ConversionStatus.FAILURE - Worker threads are abandoned after 15s if stuck in blocking calls
Performance Considerations
- Adjust batch sizes based on available memory and GPU capacity
- Set
document_timeoutto prevent indefinite processing - Configure
queue_max_sizeto balance memory usage and throughput - Disable unused features (OCR, table structure) to improve speed