Overview
VlmPipeline uses Vision-Language Models (VLMs) to convert documents by processing page images directly. It supports multiple VLM backends and response formats for flexible document understanding.
Class Signature
class VlmPipeline ( PaginatedPipeline ):
def __init__ ( self , pipeline_options : VlmPipelineOptions)
Parameters
pipeline_options
VlmPipelineOptions
required
Configuration options for the VLM pipeline vlm_options
VlmConvertOptions | InlineVlmOptions | ApiVlmOptions
required
VLM model configuration. Use VlmConvertOptions.from_preset() for new runtime system
If True and using DOCTAGS format, extract text from backend using predicted bounding boxes
Generate page images in output
Generate cropped images for picture elements
Scale factor for generated images
Maximum time in seconds for document processing
Path to directory containing model artifacts
Hardware acceleration configuration (device, num_threads, etc.)
Enable remote model services
VLM Options
VlmConvertOptions (Recommended)
New runtime system with preset support:
from docling.datamodel.pipeline_options import VlmConvertOptions
# Using preset
options = VlmConvertOptions.from_preset( 'smoldocling' )
# Custom configuration
options = VlmConvertOptions(
model_spec = ModelSpec( ... ),
force_backend_text = False
)
Legacy Options (Deprecated)
InlineVlmOptions and ApiVlmOptions are deprecated. Migrate to VlmConvertOptions.
The pipeline supports multiple VLM output formats:
Structured format with document tags and bounding boxes:
from docling.datamodel.pipeline_options_vlm_model import ResponseFormat
options = VlmConvertOptions.from_preset( 'smoldocling' ) # Uses DOCTAGS
MARKDOWN
Markdown output from VLM:
options = VlmConvertOptions(
model_spec = ModelSpec( response_format = ResponseFormat. MARKDOWN )
)
HTML
HTML output from VLM:
options = VlmConvertOptions(
model_spec = ModelSpec( response_format = ResponseFormat. HTML )
)
DEEPSEEKOCR_MARKDOWN
DeepSeek OCR markdown format with labeled bounding boxes:
options = VlmConvertOptions(
model_spec = ModelSpec( response_format = ResponseFormat. DEEPSEEKOCR_MARKDOWN )
)
Supported labels:
text - Standard body text
title - Main document or section titles
sub_title - Secondary headings
table - Tabular data
table_caption - Table descriptions
figure - Image-based elements
figure_caption - Figure descriptions
header / footer - Page margins content
Methods
execute
Executes the pipeline on an input document.
def execute (
self ,
in_doc : InputDocument,
raises_on_error : bool
) -> ConversionResult
Input document to process (typically PDF)
If True, raises exceptions on errors; otherwise captures them in ConversionResult
Conversion result containing the processed document, pages, and VLM predictions
initialize_page
Initializes page resources and loads backend.
def initialize_page (
self ,
conv_res : ConversionResult,
page : Page
) -> Page
Conversion result context
Initialized page with backend and size information
get_default_options
Returns default pipeline options.
@ classmethod
def get_default_options ( cls ) -> VlmPipelineOptions
Default configuration for VlmPipeline
is_backend_supported
Checks if a backend is supported by this pipeline.
@ classmethod
def is_backend_supported ( cls , backend : AbstractDocumentBackend) -> bool
backend
AbstractDocumentBackend
required
Backend instance to check
True if backend is PdfDocumentBackend, False otherwise
Inference Frameworks
from docling.datamodel.pipeline_options_vlm_model import (
InlineVlmOptions,
InferenceFramework
)
options = InlineVlmOptions(
inference_framework = InferenceFramework. TRANSFORMERS ,
repo_id = "HuggingFaceM4/SmolDocling-256M"
)
MLX (Apple Silicon)
options = InlineVlmOptions(
inference_framework = InferenceFramework. MLX ,
repo_id = "mlx-community/SmolDocling-256M-4bit"
)
options = InlineVlmOptions(
inference_framework = InferenceFramework. VLLM ,
repo_id = "HuggingFaceM4/SmolDocling-256M"
)
API-based Models
from docling.datamodel.pipeline_options_vlm_model import ApiVlmOptions
options = ApiVlmOptions(
api_key = "your-api-key" ,
model = "gpt-4-vision"
)
Usage Example
from docling.pipeline.vlm_pipeline import VlmPipeline
from docling.datamodel.pipeline_options import (
VlmPipelineOptions,
VlmConvertOptions
)
from docling.datamodel.document import InputDocument
# Configure pipeline with preset
vlm_options = VlmConvertOptions.from_preset( 'smoldocling' )
pipeline_options = VlmPipelineOptions(
vlm_options = vlm_options,
generate_page_images = True ,
images_scale = 2.0
)
# Create pipeline
pipeline = VlmPipeline( pipeline_options = pipeline_options)
# Process document
input_doc = InputDocument( path_or_stream = "document.pdf" )
result = pipeline.execute(input_doc, raises_on_error = False )
if result.status == ConversionStatus. SUCCESS :
doc = result.document
print (doc.export_to_markdown())
Force Backend Text
When using DOCTAGS format with force_backend_text=True, the pipeline:
Uses VLM to predict bounding boxes and structure
Extracts actual text from the PDF backend using those boxes
Combines structure from VLM with text from backend
This can improve text accuracy for documents with complex layouts.
options = VlmPipelineOptions(
vlm_options = VlmConvertOptions.from_preset( 'smoldocling' ),
force_backend_text = True # Extract text from backend
)
VLM inference is typically slower than traditional OCR pipelines
Use GPU acceleration for better performance
Consider batch processing for multiple documents
MLX framework is optimized for Apple Silicon
VLLM provides best throughput for high-volume processing
API-based models offload computation but require network connectivity