Pipelines in Docling orchestrate the document conversion process, applying backends, ML models, and transformations to produce structured DoclingDocument output.
Pipeline Architecture
All pipelines inherit from BasePipeline (docling/pipeline/base_pipeline.py:45) and implement a common interface:
class BasePipeline ( ABC ):
def execute ( self , in_doc : InputDocument, raises_on_error : bool ) -> ConversionResult:
# Build document structure
conv_res = self ._build_document(conv_res)
# Assemble into DoclingDocument
conv_res = self ._assemble_document(conv_res)
# Apply enrichment models
conv_res = self ._enrich_document(conv_res)
return conv_res
The pipeline execution follows three main phases:
Build
Extract content using format-specific backends and apply core processing models (OCR, layout analysis, etc.).
Assemble
Structure extracted content into a unified DoclingDocument with proper hierarchy and reading order.
Enrich
Apply optional ML models for enhancement (picture classification, chart extraction, etc.).
Pipeline Types
Docling provides several pipeline implementations for different use cases:
SimplePipeline
Purpose : Process formats with declarative backends that directly output DoclingDocument.
Supported formats : DOCX, HTML, Markdown, CSV, Excel, PowerPoint, JATS XML, USPTO XML, XBRL, AsciiDoc, LaTeX
Implementation : docling/pipeline/simple_pipeline.py:16
The SimplePipeline delegates conversion entirely to the backend:
def _build_document ( self , conv_res : ConversionResult) -> ConversionResult:
# Backend directly produces DoclingDocument
conv_res.document = conv_res.input._backend.convert()
return conv_res
Usage :
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat
converter = DocumentConverter( allowed_formats = [InputFormat. DOCX ])
result = converter.convert( "document.docx" ) # Uses SimplePipeline
StandardPdfPipeline
Purpose : High-performance multi-threaded PDF processing with full feature support.
Supported formats : PDF, images
Implementation : docling/pipeline/standard_pdf_pipeline.py:433
The StandardPdfPipeline executes multiple stages in parallel across pages:
Preprocess : Load page backends, scale images
OCR : Text recognition for scanned content
Layout : Detect document structure (headings, paragraphs, tables)
Table Structure : Parse table cells and relationships
Assemble : Combine page elements
Key features :
Thread-safe multi-stage processing with bounded queues
Per-page isolation (each execution uses independent threads)
Configurable batch sizes and concurrency
Timeout support with graceful degradation
Architecture :
class StandardPdfPipeline ( ConvertPipeline ):
def __init__ ( self , pipeline_options : ThreadedPdfPipelineOptions):
# Initialize heavy models once
self .preprocessing_model = PagePreprocessingModel( ... )
self .ocr_model = get_ocr_factory().create_instance( ... )
self .layout_model = get_layout_factory().create_instance( ... )
self .table_model = get_table_structure_factory().create_instance( ... )
self .assemble_model = PageAssembleModel( ... )
Usage :
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import ThreadedPdfPipelineOptions
# Customize PDF pipeline
pipeline_options = ThreadedPdfPipelineOptions(
do_ocr = True ,
do_table_structure = True ,
ocr_batch_size = 8 ,
layout_batch_size = 4 ,
)
converter = DocumentConverter(
format_options = {
InputFormat. PDF : PdfFormatOption(
pipeline_options = pipeline_options
)
}
)
result = converter.convert( "document.pdf" )
The StandardPdfPipeline is designed for production use with deterministic run identifiers, explicit backpressure, and clean shutdown semantics. See the source documentation for implementation details.
VlmPipeline
Purpose : Vision-language model based document conversion.
Supported formats : PDF, images, HTML, Markdown
Implementation : docling/pipeline/vlm_pipeline.py:61
The VlmPipeline uses vision-language models to understand document content:
Response formats :
DocTags : Structured format with bounding boxes
Markdown : Plain markdown output
HTML : HTML representation
DeepSeek OCR Markdown : Extended markdown with position annotations
Models supported :
API-based models (OpenAI, Anthropic, etc.)
HuggingFace Transformers
MLX (Apple Silicon)
vLLM
Usage :
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
VlmPipelineOptions,
VlmConvertOptions,
)
from docling.pipeline.vlm_pipeline import VlmPipeline
# Using preset configuration
vlm_options = VlmConvertOptions.from_preset( 'smoldocling' )
converter = DocumentConverter(
format_options = {
InputFormat. PDF : PdfFormatOption(
pipeline_cls = VlmPipeline,
pipeline_options = VlmPipelineOptions( vlm_options = vlm_options)
)
}
)
result = converter.convert( "document.pdf" )
The VlmPipeline supports a force_backend_text option that extracts text from the PDF backend using predicted bounding boxes, which can improve accuracy for DocTags output.
AsrPipeline
Purpose : Audio transcription and document conversion.
Supported formats : Audio files (MP3, WAV, etc.), WebVTT
Implementation : docling/pipeline/asr_pipeline.py
Converts audio content to text using automatic speech recognition models.
Pipeline Options
Each pipeline accepts configuration through PipelineOptions subclasses:
ConvertPipelineOptions
Base options for conversion pipelines:
from docling.datamodel.pipeline_options import ConvertPipelineOptions
options = ConvertPipelineOptions(
do_picture_classification = True ,
do_picture_description = False ,
do_chart_extraction = False ,
artifacts_path = "./models" ,
enable_remote_services = False ,
)
Common parameters :
artifacts_path: Local path for model files
enable_remote_services: Allow remote API calls
allow_external_plugins: Enable plugin system
do_picture_classification: Classify images as charts, diagrams, etc.
do_picture_description: Generate image captions
do_chart_extraction: Extract data from charts
accelerator_options: GPU/hardware acceleration settings
ThreadedPdfPipelineOptions
Options specific to StandardPdfPipeline:
from docling.datamodel.pipeline_options import ThreadedPdfPipelineOptions
options = ThreadedPdfPipelineOptions(
do_ocr = True ,
do_table_structure = True ,
do_code_enrichment = False ,
do_formula_enrichment = False ,
# Batch configuration
ocr_batch_size = 8 ,
layout_batch_size = 4 ,
table_batch_size = 4 ,
# Threading
queue_max_size = 32 ,
batch_polling_interval_seconds = 0.1 ,
# Timeouts
document_timeout = None , # Seconds per document
# Image generation
generate_page_images = False ,
generate_picture_images = False ,
images_scale = 1.0 ,
)
Performance tuning :
ocr_batch_size: Pages per OCR batch (higher = better GPU utilization)
layout_batch_size: Pages per layout analysis batch
queue_max_size: Bounded queue size between stages (backpressure control)
batch_polling_interval_seconds: How often to check queues for new items
VlmPipelineOptions
Options for VLM-based pipelines:
from docling.datamodel.pipeline_options import (
VlmPipelineOptions,
VlmConvertOptions,
)
vlm_options = VlmConvertOptions.from_preset( 'smoldocling' )
options = VlmPipelineOptions(
vlm_options = vlm_options,
force_backend_text = False ,
images_scale = 1.0 ,
generate_page_images = True ,
generate_picture_images = False ,
)
Enrichment Pipeline
All ConvertPipeline instances support an enrichment phase that applies ML models to the assembled DoclingDocument:
def _enrich_document ( self , conv_res : ConversionResult) -> ConversionResult:
for model in self .enrichment_pipe:
for element_batch in chunkify(
_prepare_elements(conv_res, model),
model.elements_batch_size,
):
for element in model( doc = conv_res.document, element_batch = element_batch):
pass # Models modify document in-place
return conv_res
Available enrichment models :
DocumentPictureClassifier: Classify images by type
Picture description models: Generate captions
ChartExtractionModelGraniteVision: Extract data from charts
CodeFormulaVlmModel: Detect and parse code blocks and formulas
Custom Pipelines
You can create custom pipelines by subclassing BasePipeline:
from docling.pipeline.base_pipeline import BasePipeline
from docling.datamodel.pipeline_options import PipelineOptions
from docling.datamodel.document import ConversionResult
from docling.datamodel.base_models import ConversionStatus
class CustomPipeline ( BasePipeline ):
def _build_document ( self , conv_res : ConversionResult) -> ConversionResult:
# Your custom processing logic
return conv_res
def _determine_status ( self , conv_res : ConversionResult) -> ConversionStatus:
return ConversionStatus. SUCCESS
@ classmethod
def get_default_options ( cls ) -> PipelineOptions:
return PipelineOptions()
@ classmethod
def is_backend_supported ( cls , backend : AbstractDocumentBackend):
return True # Or implement your logic
Pipeline Lifecycle
The complete execution flow for a pipeline:
def execute ( self , in_doc : InputDocument, raises_on_error : bool ) -> ConversionResult:
conv_res = ConversionResult( input = in_doc)
try :
# Build: Extract and process content
conv_res = self ._build_document(conv_res)
# Assemble: Create DoclingDocument
conv_res = self ._assemble_document(conv_res)
# Enrich: Apply ML models
conv_res = self ._enrich_document(conv_res)
# Determine final status
conv_res.status = self ._determine_status(conv_res)
except Exception as e:
# Handle errors
conv_res.status = ConversionStatus. FAILURE
if raises_on_error:
raise
finally :
# Cleanup resources
self ._unload(conv_res)
return conv_res
Model Initialization
Heavy models (OCR, layout, table structure) are initialized once per pipeline instance and reused:
# DocumentConverter caches pipelines by (class, options_hash)
# This ensures models are loaded only once
converter = DocumentConverter()
# All PDFs share the same pipeline instance
for pdf_path in pdf_files:
result = converter.convert(pdf_path)
Batch Processing
For StandardPdfPipeline, optimal performance comes from:
Page batching : Process multiple pages together for GPU efficiency
Stage parallelism : Different pages at different stages simultaneously
Queue backpressure : Prevent memory overflow with bounded queues
Timeout Handling
Set document timeouts to prevent hanging on problematic files:
options = ThreadedPdfPipelineOptions( document_timeout = 300.0 ) # 5 minutes
converter = DocumentConverter(
format_options = {
InputFormat. PDF : PdfFormatOption( pipeline_options = options)
}
)
# Partial results returned if timeout exceeded
result = converter.convert( "large_document.pdf" )
if result.status == ConversionStatus. PARTIAL_SUCCESS :
print ( f "Converted { len (result.pages) } pages before timeout" )
Architecture Understand how pipelines fit in Docling’s architecture
Backends Learn about format-specific document backends
DoclingDocument Explore the document representation pipelines produce
Usage Guide See pipeline configuration examples