Vision Language Models (VLM)

Overview

Vision-Language Models (VLMs) enable end-to-end document understanding by processing document pages as images and directly generating structured output. This approach is an alternative to the traditional pipeline (layout analysis + OCR + table extraction). Docling’s VlmPipeline supports:

Local models: Run models on your hardware (CPU, GPU, Apple Silicon)
Multiple frameworks: Transformers, MLX (Apple Silicon optimized)
Remote models: Connect to vLLM, Ollama, or cloud APIs
Multiple output formats: DocTags (preferred), Markdown, HTML

Quick Start

The simplest way to use VLM with Docling:

CLI
Python (Default)

# Uses default VLM (Granite Docling)
docling --pipeline vlm document.pdf

# Specify a different model
docling --pipeline vlm --vlm-model smoldocling document.pdf

# With MLX acceleration on Apple Silicon
docling --pipeline vlm --vlm-model granite_docling document.pdf

from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.pipeline.vlm_pipeline import VlmPipeline

# No configuration needed - uses default VLM (Granite Docling)
converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_cls=VlmPipeline,
        ),
    }
)

result = converter.convert("document.pdf")
print(result.document.export_to_markdown())

Available Models

Docling supports multiple VLM models with different trade-offs:

Recommended Models

Model	Framework	Device	Speed	Accuracy	Output Format
Granite Docling	Transformers/MLX	CPU/GPU/MPS	Fast	Excellent	DocTags
SmolDocling	Transformers/MLX	CPU/GPU/MPS	Very Fast	Good	DocTags
Qwen 2.5 VL	MLX	MPS	Medium	Excellent	Markdown
Pixtral	Transformers/MLX	CPU/GPU/MPS	Slow	Excellent	Markdown
Granite Vision	Transformers	CPU/GPU/MPS	Medium	Good	Markdown

DocTags is the preferred output format - it’s a structured document representation that Docling converts to DoclingDocument for consistent export to Markdown, HTML, JSON, etc.

Using Presets

The recommended way to configure VLMs is with presets:

from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    VlmPipelineOptions,
    VlmConvertOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.pipeline.vlm_pipeline import VlmPipeline

# Use a preset by name
vlm_options = VlmConvertOptions.from_preset("smoldocling")

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_cls=VlmPipeline,
            pipeline_options=VlmPipelineOptions(vlm_options=vlm_options),
        ),
    }
)

result = converter.convert("document.pdf")

Available Presets

Granite Docling (Default)
SmolDocling
Qwen 2.5 VL
Pixtral 12B

vlm_options = VlmConvertOptions.from_preset("granite_docling")

Model: ibm-granite/granite-docling-258MBest for: Production use, balanced speed/accuracy, DocTags outputAuto-selects:

MLX variant on Apple Silicon
Transformers variant on other platforms

vlm_options = VlmConvertOptions.from_preset("smoldocling")

Model: ds4sd/SmolDocling-256M-previewBest for: Fast processing, lightweight deployment, DocTags outputSpeed: ~6s per page on M3 Max (MLX)

vlm_options = VlmConvertOptions.from_preset("qwen25vl_3b")

Model: Qwen2.5-VL-3BBest for: High accuracy, Markdown output, multilingualSpeed: ~23s per page on M3 Max (MLX)

vlm_options = VlmConvertOptions.from_preset("pixtral_12b")

Model: mistral-community/pixtral-12bBest for: Maximum accuracy, complex documentsSpeed: ~309s per page on M3 Max (MLX), slower on CPU

Runtime Selection

Control which inference framework to use:

Auto (Recommended)
MLX (Apple Silicon)
Transformers

from docling.datamodel.pipeline_options import VlmConvertOptions

# Automatically selects best runtime for your platform
vlm_options = VlmConvertOptions.from_preset("granite_docling")
# Uses MLX on Apple Silicon, Transformers elsewhere

from docling.datamodel.pipeline_options import VlmConvertOptions
from docling.datamodel.vlm_engine_options import MlxVlmEngineOptions

vlm_options = VlmConvertOptions.from_preset(
    "granite_docling",
    engine_options=MlxVlmEngineOptions(),
)

Requirements:

pip install "docling[vlm,mlx]"

Best for: Apple Silicon (M1/M2/M3) - 10-20x faster than Transformers

from docling.datamodel.pipeline_options import VlmConvertOptions
from docling.datamodel.vlm_engine_options import TransformersVlmEngineOptions

vlm_options = VlmConvertOptions.from_preset(
    "granite_docling",
    engine_options=TransformersVlmEngineOptions(),
)

Requirements:

pip install "docling[vlm]"

Best for: CUDA GPUs, general compatibility

Remote Models (API)

Connect to models hosted on remote servers:

vLLM Server

Start vLLM Server

pip install vllm

vllm serve ibm-granite/granite-docling-258M \
  --dtype bfloat16 \
  --max-model-len 4096

Configure Docling

from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    VlmPipelineOptions,
    ApiVlmOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.pipeline.vlm_pipeline import VlmPipeline

pipeline_options = VlmPipelineOptions(
    enable_remote_services=True,  # Required!
    vlm_options=ApiVlmOptions(
        url="http://localhost:8000/v1/chat/completions",
        model="ibm-granite/granite-docling-258M",
        api_key=None,  # If required
    ),
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_cls=VlmPipeline,
            pipeline_options=pipeline_options,
        )
    }
)

result = converter.convert("document.pdf")

Ollama

Start Ollama with a VLM

# Pull a vision model
ollama pull llava:7b

# Start Ollama server (usually auto-starts)
ollama serve

Configure Docling

from docling.datamodel.pipeline_options import (
    VlmPipelineOptions,
    ApiVlmOptions,
)
from docling.datamodel.vlm_engine_options import VlmEngineType

pipeline_options = VlmPipelineOptions(
    enable_remote_services=True,
    vlm_options=ApiVlmOptions(
        url="http://localhost:11434/v1/chat/completions",
        model="llava:7b",
        engine_type=VlmEngineType.API_OLLAMA,
    ),
)

VLM Configuration Options

Customize VLM behavior:

from docling.datamodel.pipeline_options import (
    VlmPipelineOptions,
    VlmConvertOptions,
)

vlm_options = VlmConvertOptions.from_preset("granite_docling")

pipeline_options = VlmPipelineOptions(
    vlm_options=vlm_options,
    
    # Image preprocessing
    generate_page_images=True,  # Required for VLM
    force_backend_text=False,   # Use VLM text vs. PDF text
    
    # Picture enrichment (optional)
    do_picture_description=True,
    do_picture_classification=True,
)

Image Scaling

Control input image resolution:

from docling.datamodel.pipeline_options import VlmConvertOptions

vlm_options = VlmConvertOptions.from_preset("granite_docling")
vlm_options.scale = 2.0      # Image scale factor (2.0 = high res)
vlm_options.max_size = 1024  # Max dimension (width/height)

Higher scale = better accuracy but slower processing.

Manual Configuration (Without Presets)

For full control, manually configure the VLM:

from docling.datamodel.pipeline_options_vlm_model import (
    InlineVlmOptions,
    InferenceFramework,
    ResponseFormat,
    TransformersModelType,
)
from docling.datamodel.pipeline_options import VlmPipelineOptions
from docling.datamodel.accelerator_options import AcceleratorDevice

pipeline_options = VlmPipelineOptions(
    vlm_options=InlineVlmOptions(
        repo_id="ibm-granite/granite-vision-3.2-2b",
        prompt="Convert this page to markdown. Do not miss any text and only output the bare markdown!",
        response_format=ResponseFormat.MARKDOWN,
        inference_framework=InferenceFramework.TRANSFORMERS,
        transformers_model_type=TransformersModelType.AUTOMODEL_VISION2SEQ,
        supported_devices=[
            AcceleratorDevice.CPU,
            AcceleratorDevice.CUDA,
            AcceleratorDevice.MPS,
        ],
        scale=2.0,
        temperature=0.0,
    )
)

Picture Description with VLMs

Use VLMs to caption images within documents:

from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    PictureDescriptionVlmEngineOptions,
)
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption

pipeline_options = PdfPipelineOptions(
    do_picture_description=True,
    picture_description_options=PictureDescriptionVlmEngineOptions.from_preset(
        "smolvlm"  # or "granite_vision", "pixtral", "qwen"
    ),
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

result = converter.convert("document.pdf")

# Access picture descriptions
from docling_core.types.doc import PictureItem

for item, level in result.document.iterate_items():
    if isinstance(item, PictureItem):
        print(f"Picture: {item.self_ref}")
        print(f"Caption: {item.caption_text(doc=result.document)}")

Performance Benchmarks

Benchmarked on M3 Max with single test page:

Model	Framework	Time (sec)	Memory
SmolDocling	MLX	6.2	Low
SmolDocling	Transformers	102.2	Medium
Granite Docling	MLX	~8	Low
Qwen 2.5 VL 3B	MLX	23.5	Medium
Pixtral 12B	MLX	308.9	High
Gemma 3 12B	MLX	378.5	High
Phi-4	Transformers (CPU)	1175.7	Medium

MLX provides 10-20x speedup on Apple Silicon compared to Transformers. Always use MLX on M1/M2/M3 devices.

When to Use VLM vs. Traditional Pipeline

Use VLM When

Documents have complex layouts
Mixed content types (text, tables, images)
Handwritten or unusual fonts
You want end-to-end learning
Simplicity over control

Use Traditional Pipeline When

You need fine-grained control
Processing simple, structured PDFs
OCR customization is critical
Resources are limited (VLM needs more memory)
Batch processing at scale

Troubleshooting

Model download fails

Pre-download models:

docling-tools models download-hf-repo ibm-granite/granite-docling-258M

Then set artifacts_path:

pipeline_options.artifacts_path = "/path/to/models"

Out of memory errors

Use smaller models (SmolDocling instead of Pixtral)
Reduce scale parameter: vlm_options.scale = 1.0
Set max_size: vlm_options.max_size = 768
Process fewer pages at once

MLX not using GPU

Verify MLX installation:

pip install mlx mlx-vlm
python -c "import mlx.core as mx; print(mx.metal.is_available())"

Should print True on Apple Silicon.

Poor quality results

Increase image scale: vlm_options.scale = 2.0
Try a larger model (Qwen, Pixtral)
Check prompt is appropriate for your content
Use Granite Docling or SmolDocling for DocTags output (better structure)

Next Steps

PDF Processing

Compare with traditional PDF pipeline options

Export Formats

Export VLM-processed documents to various formats

Batch Processing

Optimize VLM processing for document batches

Advanced Options

Configure hardware acceleration and remote services

Get Started

Core Concepts

Usage Guides

Advanced Features

Integrations

Vision Language Models (VLM)

Overview

Quick Start

Available Models

Recommended Models

Using Presets

Available Presets

Runtime Selection

Remote Models (API)

vLLM Server

Ollama

VLM Configuration Options

Image Scaling

Manual Configuration (Without Presets)

Picture Description with VLMs

Performance Benchmarks

When to Use VLM vs. Traditional Pipeline

Use VLM When

Use Traditional Pipeline When

Troubleshooting

Next Steps

PDF Processing

Export Formats

Batch Processing

Advanced Options

Build docs developers (and LLMs) love

Get Started

Core Concepts

Usage Guides

Advanced Features

Integrations

​Overview

​Quick Start

​Available Models

​Recommended Models

​Using Presets

​Available Presets

​Runtime Selection

​Remote Models (API)

​vLLM Server

​Ollama

​VLM Configuration Options

​Image Scaling

​Manual Configuration (Without Presets)

​Picture Description with VLMs

​Performance Benchmarks

​When to Use VLM vs. Traditional Pipeline

Use VLM When

Use Traditional Pipeline When

​Troubleshooting

​Next Steps

PDF Processing

Export Formats

Batch Processing

Advanced Options

Build docs developers (and LLMs) love

Overview

Quick Start

Available Models

Recommended Models

Using Presets

Available Presets

Runtime Selection

Remote Models (API)

vLLM Server

Ollama

VLM Configuration Options

Image Scaling

Manual Configuration (Without Presets)

Picture Description with VLMs

Performance Benchmarks

When to Use VLM vs. Traditional Pipeline

Troubleshooting

Next Steps