Skip to main content

Overview

Vision-Language Models (VLMs) enable end-to-end document understanding by processing document pages as images and directly generating structured output. This approach is an alternative to the traditional pipeline (layout analysis + OCR + table extraction). Docling’s VlmPipeline supports:
  • Local models: Run models on your hardware (CPU, GPU, Apple Silicon)
  • Multiple frameworks: Transformers, MLX (Apple Silicon optimized)
  • Remote models: Connect to vLLM, Ollama, or cloud APIs
  • Multiple output formats: DocTags (preferred), Markdown, HTML

Quick Start

The simplest way to use VLM with Docling:
# Uses default VLM (Granite Docling)
docling --pipeline vlm document.pdf

# Specify a different model
docling --pipeline vlm --vlm-model smoldocling document.pdf

# With MLX acceleration on Apple Silicon
docling --pipeline vlm --vlm-model granite_docling document.pdf

Available Models

Docling supports multiple VLM models with different trade-offs:
ModelFrameworkDeviceSpeedAccuracyOutput Format
Granite DoclingTransformers/MLXCPU/GPU/MPSFastExcellentDocTags
SmolDoclingTransformers/MLXCPU/GPU/MPSVery FastGoodDocTags
Qwen 2.5 VLMLXMPSMediumExcellentMarkdown
PixtralTransformers/MLXCPU/GPU/MPSSlowExcellentMarkdown
Granite VisionTransformersCPU/GPU/MPSMediumGoodMarkdown
DocTags is the preferred output format - it’s a structured document representation that Docling converts to DoclingDocument for consistent export to Markdown, HTML, JSON, etc.

Using Presets

The recommended way to configure VLMs is with presets:
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    VlmPipelineOptions,
    VlmConvertOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.pipeline.vlm_pipeline import VlmPipeline

# Use a preset by name
vlm_options = VlmConvertOptions.from_preset("smoldocling")

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_cls=VlmPipeline,
            pipeline_options=VlmPipelineOptions(vlm_options=vlm_options),
        ),
    }
)

result = converter.convert("document.pdf")

Available Presets

vlm_options = VlmConvertOptions.from_preset("granite_docling")
Model: ibm-granite/granite-docling-258MBest for: Production use, balanced speed/accuracy, DocTags outputAuto-selects:
  • MLX variant on Apple Silicon
  • Transformers variant on other platforms

Runtime Selection

Control which inference framework to use:

Remote Models (API)

Connect to models hosted on remote servers:

vLLM Server

1

Start vLLM Server

pip install vllm

vllm serve ibm-granite/granite-docling-258M \
  --dtype bfloat16 \
  --max-model-len 4096
2

Configure Docling

from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    VlmPipelineOptions,
    ApiVlmOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.pipeline.vlm_pipeline import VlmPipeline

pipeline_options = VlmPipelineOptions(
    enable_remote_services=True,  # Required!
    vlm_options=ApiVlmOptions(
        url="http://localhost:8000/v1/chat/completions",
        model="ibm-granite/granite-docling-258M",
        api_key=None,  # If required
    ),
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_cls=VlmPipeline,
            pipeline_options=pipeline_options,
        )
    }
)

result = converter.convert("document.pdf")

Ollama

1

Start Ollama with a VLM

# Pull a vision model
ollama pull llava:7b

# Start Ollama server (usually auto-starts)
ollama serve
2

Configure Docling

from docling.datamodel.pipeline_options import (
    VlmPipelineOptions,
    ApiVlmOptions,
)
from docling.datamodel.vlm_engine_options import VlmEngineType

pipeline_options = VlmPipelineOptions(
    enable_remote_services=True,
    vlm_options=ApiVlmOptions(
        url="http://localhost:11434/v1/chat/completions",
        model="llava:7b",
        engine_type=VlmEngineType.API_OLLAMA,
    ),
)

VLM Configuration Options

Customize VLM behavior:
from docling.datamodel.pipeline_options import (
    VlmPipelineOptions,
    VlmConvertOptions,
)

vlm_options = VlmConvertOptions.from_preset("granite_docling")

pipeline_options = VlmPipelineOptions(
    vlm_options=vlm_options,
    
    # Image preprocessing
    generate_page_images=True,  # Required for VLM
    force_backend_text=False,   # Use VLM text vs. PDF text
    
    # Picture enrichment (optional)
    do_picture_description=True,
    do_picture_classification=True,
)

Image Scaling

Control input image resolution:
from docling.datamodel.pipeline_options import VlmConvertOptions

vlm_options = VlmConvertOptions.from_preset("granite_docling")
vlm_options.scale = 2.0      # Image scale factor (2.0 = high res)
vlm_options.max_size = 1024  # Max dimension (width/height)
Higher scale = better accuracy but slower processing.

Manual Configuration (Without Presets)

For full control, manually configure the VLM:
from docling.datamodel.pipeline_options_vlm_model import (
    InlineVlmOptions,
    InferenceFramework,
    ResponseFormat,
    TransformersModelType,
)
from docling.datamodel.pipeline_options import VlmPipelineOptions
from docling.datamodel.accelerator_options import AcceleratorDevice

pipeline_options = VlmPipelineOptions(
    vlm_options=InlineVlmOptions(
        repo_id="ibm-granite/granite-vision-3.2-2b",
        prompt="Convert this page to markdown. Do not miss any text and only output the bare markdown!",
        response_format=ResponseFormat.MARKDOWN,
        inference_framework=InferenceFramework.TRANSFORMERS,
        transformers_model_type=TransformersModelType.AUTOMODEL_VISION2SEQ,
        supported_devices=[
            AcceleratorDevice.CPU,
            AcceleratorDevice.CUDA,
            AcceleratorDevice.MPS,
        ],
        scale=2.0,
        temperature=0.0,
    )
)

Picture Description with VLMs

Use VLMs to caption images within documents:
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    PictureDescriptionVlmEngineOptions,
)
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption

pipeline_options = PdfPipelineOptions(
    do_picture_description=True,
    picture_description_options=PictureDescriptionVlmEngineOptions.from_preset(
        "smolvlm"  # or "granite_vision", "pixtral", "qwen"
    ),
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

result = converter.convert("document.pdf")

# Access picture descriptions
from docling_core.types.doc import PictureItem

for item, level in result.document.iterate_items():
    if isinstance(item, PictureItem):
        print(f"Picture: {item.self_ref}")
        print(f"Caption: {item.caption_text(doc=result.document)}")

Performance Benchmarks

Benchmarked on M3 Max with single test page:
ModelFrameworkTime (sec)Memory
SmolDoclingMLX6.2Low
SmolDoclingTransformers102.2Medium
Granite DoclingMLX~8Low
Qwen 2.5 VL 3BMLX23.5Medium
Pixtral 12BMLX308.9High
Gemma 3 12BMLX378.5High
Phi-4Transformers (CPU)1175.7Medium
MLX provides 10-20x speedup on Apple Silicon compared to Transformers. Always use MLX on M1/M2/M3 devices.

When to Use VLM vs. Traditional Pipeline

Use VLM When

  • Documents have complex layouts
  • Mixed content types (text, tables, images)
  • Handwritten or unusual fonts
  • You want end-to-end learning
  • Simplicity over control

Use Traditional Pipeline When

  • You need fine-grained control
  • Processing simple, structured PDFs
  • OCR customization is critical
  • Resources are limited (VLM needs more memory)
  • Batch processing at scale

Troubleshooting

Pre-download models:
docling-tools models download-hf-repo ibm-granite/granite-docling-258M
Then set artifacts_path:
pipeline_options.artifacts_path = "/path/to/models"
  • Use smaller models (SmolDocling instead of Pixtral)
  • Reduce scale parameter: vlm_options.scale = 1.0
  • Set max_size: vlm_options.max_size = 768
  • Process fewer pages at once
Verify MLX installation:
pip install mlx mlx-vlm
python -c "import mlx.core as mx; print(mx.metal.is_available())"
Should print True on Apple Silicon.
  • Increase image scale: vlm_options.scale = 2.0
  • Try a larger model (Qwen, Pixtral)
  • Check prompt is appropriate for your content
  • Use Granite Docling or SmolDocling for DocTags output (better structure)

Next Steps

PDF Processing

Compare with traditional PDF pipeline options

Export Formats

Export VLM-processed documents to various formats

Batch Processing

Optimize VLM processing for document batches

Advanced Options

Configure hardware acceleration and remote services

Build docs developers (and LLMs) love