VlmPipeline

Overview

VlmPipeline uses Vision-Language Models (VLMs) to convert documents by processing page images directly. It supports multiple VLM backends and response formats for flexible document understanding.

Class Signature

class VlmPipeline(PaginatedPipeline):
    def __init__(self, pipeline_options: VlmPipelineOptions)

Parameters

pipeline_options

VlmPipelineOptions

required

Configuration options for the VLM pipeline

Show properties

vlm_options

VlmConvertOptions | InlineVlmOptions | ApiVlmOptions

required

VLM model configuration. Use VlmConvertOptions.from_preset() for new runtime system

force_backend_text

bool

default:"false"

If True and using DOCTAGS format, extract text from backend using predicted bounding boxes

generate_page_images

bool

default:"false"

Generate page images in output

generate_picture_images

bool

default:"false"

Generate cropped images for picture elements

images_scale

float

default:"1.0"

Scale factor for generated images

document_timeout

float | None

Maximum time in seconds for document processing

artifacts_path

str | None

Path to directory containing model artifacts

accelerator_options

AcceleratorOptions

Hardware acceleration configuration (device, num_threads, etc.)

enable_remote_services

bool

default:"false"

Enable remote model services

VLM Options

VlmConvertOptions (Recommended)

New runtime system with preset support:

from docling.datamodel.pipeline_options import VlmConvertOptions

# Using preset
options = VlmConvertOptions.from_preset('smoldocling')

# Custom configuration
options = VlmConvertOptions(
    model_spec=ModelSpec(...),
    force_backend_text=False
)

Legacy Options (Deprecated)

InlineVlmOptions and ApiVlmOptions are deprecated. Migrate to VlmConvertOptions.

Response Formats

The pipeline supports multiple VLM output formats:

DOCTAGS

Structured format with document tags and bounding boxes:

from docling.datamodel.pipeline_options_vlm_model import ResponseFormat

options = VlmConvertOptions.from_preset('smoldocling')  # Uses DOCTAGS

MARKDOWN

Markdown output from VLM:

options = VlmConvertOptions(
    model_spec=ModelSpec(response_format=ResponseFormat.MARKDOWN)
)

HTML

HTML output from VLM:

options = VlmConvertOptions(
    model_spec=ModelSpec(response_format=ResponseFormat.HTML)
)

DEEPSEEKOCR_MARKDOWN

DeepSeek OCR markdown format with labeled bounding boxes:

options = VlmConvertOptions(
    model_spec=ModelSpec(response_format=ResponseFormat.DEEPSEEKOCR_MARKDOWN)
)

Supported labels:

text - Standard body text
title - Main document or section titles
sub_title - Secondary headings
table - Tabular data
table_caption - Table descriptions
figure - Image-based elements
figure_caption - Figure descriptions
header / footer - Page margins content

Methods

execute

Executes the pipeline on an input document.

def execute(
    self,
    in_doc: InputDocument,
    raises_on_error: bool
) -> ConversionResult

in_doc

InputDocument

required

Input document to process (typically PDF)

raises_on_error

bool

required

If True, raises exceptions on errors; otherwise captures them in ConversionResult

return

ConversionResult

Conversion result containing the processed document, pages, and VLM predictions

initialize_page

Initializes page resources and loads backend.

def initialize_page(
    self,
    conv_res: ConversionResult,
    page: Page
) -> Page

conv_res

ConversionResult

required

Conversion result context

page

Page

required

Page to initialize

return

Page

Initialized page with backend and size information

get_default_options

Returns default pipeline options.

@classmethod
def get_default_options(cls) -> VlmPipelineOptions

return

VlmPipelineOptions

Default configuration for VlmPipeline

is_backend_supported

Checks if a backend is supported by this pipeline.

@classmethod
def is_backend_supported(cls, backend: AbstractDocumentBackend) -> bool

backend

AbstractDocumentBackend

required

Backend instance to check

return

bool

True if backend is PdfDocumentBackend, False otherwise

Inference Frameworks

Transformers (HuggingFace)

from docling.datamodel.pipeline_options_vlm_model import (
    InlineVlmOptions,
    InferenceFramework
)

options = InlineVlmOptions(
    inference_framework=InferenceFramework.TRANSFORMERS,
    repo_id="HuggingFaceM4/SmolDocling-256M"
)

MLX (Apple Silicon)

options = InlineVlmOptions(
    inference_framework=InferenceFramework.MLX,
    repo_id="mlx-community/SmolDocling-256M-4bit"
)

VLLM (High-performance)

options = InlineVlmOptions(
    inference_framework=InferenceFramework.VLLM,
    repo_id="HuggingFaceM4/SmolDocling-256M"
)

API-based Models

from docling.datamodel.pipeline_options_vlm_model import ApiVlmOptions

options = ApiVlmOptions(
    api_key="your-api-key",
    model="gpt-4-vision"
)

Usage Example

from docling.pipeline.vlm_pipeline import VlmPipeline
from docling.datamodel.pipeline_options import (
    VlmPipelineOptions,
    VlmConvertOptions
)
from docling.datamodel.document import InputDocument

# Configure pipeline with preset
vlm_options = VlmConvertOptions.from_preset('smoldocling')

pipeline_options = VlmPipelineOptions(
    vlm_options=vlm_options,
    generate_page_images=True,
    images_scale=2.0
)

# Create pipeline
pipeline = VlmPipeline(pipeline_options=pipeline_options)

# Process document
input_doc = InputDocument(path_or_stream="document.pdf")
result = pipeline.execute(input_doc, raises_on_error=False)

if result.status == ConversionStatus.SUCCESS:
    doc = result.document
    print(doc.export_to_markdown())

Force Backend Text

When using DOCTAGS format with force_backend_text=True, the pipeline:

Uses VLM to predict bounding boxes and structure
Extracts actual text from the PDF backend using those boxes
Combines structure from VLM with text from backend

This can improve text accuracy for documents with complex layouts.

options = VlmPipelineOptions(
    vlm_options=VlmConvertOptions.from_preset('smoldocling'),
    force_backend_text=True  # Extract text from backend
)

Performance Considerations

VLM inference is typically slower than traditional OCR pipelines
Use GPU acceleration for better performance
Consider batch processing for multiple documents
MLX framework is optimized for Apple Silicon
VLLM provides best throughput for high-volume processing
API-based models offload computation but require network connectivity

Core API

Pipelines

Options & Configuration

Backends

CLI

Overview

Class Signature

Parameters

VLM Options

VlmConvertOptions (Recommended)

Legacy Options (Deprecated)

Response Formats

DOCTAGS

MARKDOWN

HTML

DEEPSEEKOCR_MARKDOWN

Methods

execute

initialize_page

get_default_options

is_backend_supported

Inference Frameworks

Transformers (HuggingFace)

MLX (Apple Silicon)

VLLM (High-performance)

API-based Models

Usage Example

Force Backend Text

Performance Considerations

Build docs developers (and LLMs) love

Core API

Pipelines

Options & Configuration

Backends

CLI

​Overview

​Class Signature

​Parameters

​VLM Options

​VlmConvertOptions (Recommended)

​Legacy Options (Deprecated)

​Response Formats

​DOCTAGS

​MARKDOWN

​HTML

​DEEPSEEKOCR_MARKDOWN

​Methods

​execute

​initialize_page

​get_default_options

​is_backend_supported

​Inference Frameworks

​Transformers (HuggingFace)

​MLX (Apple Silicon)

​VLLM (High-performance)

​API-based Models

​Usage Example

​Force Backend Text

​Performance Considerations

Build docs developers (and LLMs) love

Overview

Class Signature

Parameters

VLM Options

VlmConvertOptions (Recommended)

Legacy Options (Deprecated)

Response Formats

DOCTAGS

MARKDOWN

HTML

DEEPSEEKOCR_MARKDOWN

Methods

execute

initialize_page

get_default_options

is_backend_supported

Inference Frameworks

Transformers (HuggingFace)

MLX (Apple Silicon)

VLLM (High-performance)

API-based Models

Usage Example

Force Backend Text

Performance Considerations