Skip to main content

Overview

VlmPipeline uses Vision-Language Models (VLMs) to convert documents by processing page images directly. It supports multiple VLM backends and response formats for flexible document understanding.

Class Signature

class VlmPipeline(PaginatedPipeline):
    def __init__(self, pipeline_options: VlmPipelineOptions)

Parameters

pipeline_options
VlmPipelineOptions
required
Configuration options for the VLM pipeline

VLM Options

New runtime system with preset support:
from docling.datamodel.pipeline_options import VlmConvertOptions

# Using preset
options = VlmConvertOptions.from_preset('smoldocling')

# Custom configuration
options = VlmConvertOptions(
    model_spec=ModelSpec(...),
    force_backend_text=False
)

Legacy Options (Deprecated)

InlineVlmOptions and ApiVlmOptions are deprecated. Migrate to VlmConvertOptions.

Response Formats

The pipeline supports multiple VLM output formats:

DOCTAGS

Structured format with document tags and bounding boxes:
from docling.datamodel.pipeline_options_vlm_model import ResponseFormat

options = VlmConvertOptions.from_preset('smoldocling')  # Uses DOCTAGS

MARKDOWN

Markdown output from VLM:
options = VlmConvertOptions(
    model_spec=ModelSpec(response_format=ResponseFormat.MARKDOWN)
)

HTML

HTML output from VLM:
options = VlmConvertOptions(
    model_spec=ModelSpec(response_format=ResponseFormat.HTML)
)

DEEPSEEKOCR_MARKDOWN

DeepSeek OCR markdown format with labeled bounding boxes:
options = VlmConvertOptions(
    model_spec=ModelSpec(response_format=ResponseFormat.DEEPSEEKOCR_MARKDOWN)
)
Supported labels:
  • text - Standard body text
  • title - Main document or section titles
  • sub_title - Secondary headings
  • table - Tabular data
  • table_caption - Table descriptions
  • figure - Image-based elements
  • figure_caption - Figure descriptions
  • header / footer - Page margins content

Methods

execute

Executes the pipeline on an input document.
def execute(
    self,
    in_doc: InputDocument,
    raises_on_error: bool
) -> ConversionResult
in_doc
InputDocument
required
Input document to process (typically PDF)
raises_on_error
bool
required
If True, raises exceptions on errors; otherwise captures them in ConversionResult
return
ConversionResult
Conversion result containing the processed document, pages, and VLM predictions

initialize_page

Initializes page resources and loads backend.
def initialize_page(
    self,
    conv_res: ConversionResult,
    page: Page
) -> Page
conv_res
ConversionResult
required
Conversion result context
page
Page
required
Page to initialize
return
Page
Initialized page with backend and size information

get_default_options

Returns default pipeline options.
@classmethod
def get_default_options(cls) -> VlmPipelineOptions
return
VlmPipelineOptions
Default configuration for VlmPipeline

is_backend_supported

Checks if a backend is supported by this pipeline.
@classmethod
def is_backend_supported(cls, backend: AbstractDocumentBackend) -> bool
backend
AbstractDocumentBackend
required
Backend instance to check
return
bool
True if backend is PdfDocumentBackend, False otherwise

Inference Frameworks

Transformers (HuggingFace)

from docling.datamodel.pipeline_options_vlm_model import (
    InlineVlmOptions,
    InferenceFramework
)

options = InlineVlmOptions(
    inference_framework=InferenceFramework.TRANSFORMERS,
    repo_id="HuggingFaceM4/SmolDocling-256M"
)

MLX (Apple Silicon)

options = InlineVlmOptions(
    inference_framework=InferenceFramework.MLX,
    repo_id="mlx-community/SmolDocling-256M-4bit"
)

VLLM (High-performance)

options = InlineVlmOptions(
    inference_framework=InferenceFramework.VLLM,
    repo_id="HuggingFaceM4/SmolDocling-256M"
)

API-based Models

from docling.datamodel.pipeline_options_vlm_model import ApiVlmOptions

options = ApiVlmOptions(
    api_key="your-api-key",
    model="gpt-4-vision"
)

Usage Example

from docling.pipeline.vlm_pipeline import VlmPipeline
from docling.datamodel.pipeline_options import (
    VlmPipelineOptions,
    VlmConvertOptions
)
from docling.datamodel.document import InputDocument

# Configure pipeline with preset
vlm_options = VlmConvertOptions.from_preset('smoldocling')

pipeline_options = VlmPipelineOptions(
    vlm_options=vlm_options,
    generate_page_images=True,
    images_scale=2.0
)

# Create pipeline
pipeline = VlmPipeline(pipeline_options=pipeline_options)

# Process document
input_doc = InputDocument(path_or_stream="document.pdf")
result = pipeline.execute(input_doc, raises_on_error=False)

if result.status == ConversionStatus.SUCCESS:
    doc = result.document
    print(doc.export_to_markdown())

Force Backend Text

When using DOCTAGS format with force_backend_text=True, the pipeline:
  1. Uses VLM to predict bounding boxes and structure
  2. Extracts actual text from the PDF backend using those boxes
  3. Combines structure from VLM with text from backend
This can improve text accuracy for documents with complex layouts.
options = VlmPipelineOptions(
    vlm_options=VlmConvertOptions.from_preset('smoldocling'),
    force_backend_text=True  # Extract text from backend
)

Performance Considerations

  • VLM inference is typically slower than traditional OCR pipelines
  • Use GPU acceleration for better performance
  • Consider batch processing for multiple documents
  • MLX framework is optimized for Apple Silicon
  • VLLM provides best throughput for high-volume processing
  • API-based models offload computation but require network connectivity

Build docs developers (and LLMs) love