Overview
Vision-Language Models (VLMs) enable end-to-end document understanding by processing document pages as images and directly generating structured output. This approach is an alternative to the traditional pipeline (layout analysis + OCR + table extraction).
Docling’s VlmPipeline supports:
Local models : Run models on your hardware (CPU, GPU, Apple Silicon)
Multiple frameworks : Transformers, MLX (Apple Silicon optimized)
Remote models : Connect to vLLM, Ollama, or cloud APIs
Multiple output formats : DocTags (preferred), Markdown, HTML
Quick Start
The simplest way to use VLM with Docling:
# Uses default VLM (Granite Docling)
docling --pipeline vlm document.pdf
# Specify a different model
docling --pipeline vlm --vlm-model smoldocling document.pdf
# With MLX acceleration on Apple Silicon
docling --pipeline vlm --vlm-model granite_docling document.pdf
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.pipeline.vlm_pipeline import VlmPipeline
# No configuration needed - uses default VLM (Granite Docling)
converter = DocumentConverter(
format_options = {
InputFormat. PDF : PdfFormatOption(
pipeline_cls = VlmPipeline,
),
}
)
result = converter.convert( "document.pdf" )
print (result.document.export_to_markdown())
Available Models
Docling supports multiple VLM models with different trade-offs:
Recommended Models
Model Framework Device Speed Accuracy Output Format Granite Docling Transformers/MLX CPU/GPU/MPS Fast Excellent DocTags SmolDocling Transformers/MLX CPU/GPU/MPS Very Fast Good DocTags Qwen 2.5 VL MLX MPS Medium Excellent Markdown Pixtral Transformers/MLX CPU/GPU/MPS Slow Excellent Markdown Granite Vision Transformers CPU/GPU/MPS Medium Good Markdown
DocTags is the preferred output format - it’s a structured document representation that Docling converts to DoclingDocument for consistent export to Markdown, HTML, JSON, etc.
Using Presets
The recommended way to configure VLMs is with presets:
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
VlmPipelineOptions,
VlmConvertOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.pipeline.vlm_pipeline import VlmPipeline
# Use a preset by name
vlm_options = VlmConvertOptions.from_preset( "smoldocling" )
converter = DocumentConverter(
format_options = {
InputFormat. PDF : PdfFormatOption(
pipeline_cls = VlmPipeline,
pipeline_options = VlmPipelineOptions( vlm_options = vlm_options),
),
}
)
result = converter.convert( "document.pdf" )
Available Presets
vlm_options = VlmConvertOptions.from_preset( "granite_docling" )
Model: ibm-granite/granite-docling-258M Best for: Production use, balanced speed/accuracy, DocTags outputAuto-selects:
MLX variant on Apple Silicon
Transformers variant on other platforms
vlm_options = VlmConvertOptions.from_preset( "smoldocling" )
Model: ds4sd/SmolDocling-256M-preview Best for: Fast processing, lightweight deployment, DocTags outputSpeed: ~6s per page on M3 Max (MLX)vlm_options = VlmConvertOptions.from_preset( "qwen25vl_3b" )
Model: Qwen2.5-VL-3B Best for: High accuracy, Markdown output, multilingualSpeed: ~23s per page on M3 Max (MLX)vlm_options = VlmConvertOptions.from_preset( "pixtral_12b" )
Model: mistral-community/pixtral-12b Best for: Maximum accuracy, complex documentsSpeed: ~309s per page on M3 Max (MLX), slower on CPU
Runtime Selection
Control which inference framework to use:
Auto (Recommended)
MLX (Apple Silicon)
Transformers
from docling.datamodel.pipeline_options import VlmConvertOptions
# Automatically selects best runtime for your platform
vlm_options = VlmConvertOptions.from_preset( "granite_docling" )
# Uses MLX on Apple Silicon, Transformers elsewhere
from docling.datamodel.pipeline_options import VlmConvertOptions
from docling.datamodel.vlm_engine_options import MlxVlmEngineOptions
vlm_options = VlmConvertOptions.from_preset(
"granite_docling" ,
engine_options = MlxVlmEngineOptions(),
)
Requirements: pip install "docling[vlm,mlx]"
Best for: Apple Silicon (M1/M2/M3) - 10-20x faster than Transformersfrom docling.datamodel.pipeline_options import VlmConvertOptions
from docling.datamodel.vlm_engine_options import TransformersVlmEngineOptions
vlm_options = VlmConvertOptions.from_preset(
"granite_docling" ,
engine_options = TransformersVlmEngineOptions(),
)
Requirements: pip install "docling[vlm]"
Best for: CUDA GPUs, general compatibility
Remote Models (API)
Connect to models hosted on remote servers:
vLLM Server
Start vLLM Server
pip install vllm
vllm serve ibm-granite/granite-docling-258M \
--dtype bfloat16 \
--max-model-len 4096
Configure Docling
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
VlmPipelineOptions,
ApiVlmOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.pipeline.vlm_pipeline import VlmPipeline
pipeline_options = VlmPipelineOptions(
enable_remote_services = True , # Required!
vlm_options = ApiVlmOptions(
url = "http://localhost:8000/v1/chat/completions" ,
model = "ibm-granite/granite-docling-258M" ,
api_key = None , # If required
),
)
converter = DocumentConverter(
format_options = {
InputFormat. PDF : PdfFormatOption(
pipeline_cls = VlmPipeline,
pipeline_options = pipeline_options,
)
}
)
result = converter.convert( "document.pdf" )
Ollama
Start Ollama with a VLM
# Pull a vision model
ollama pull llava:7b
# Start Ollama server (usually auto-starts)
ollama serve
Configure Docling
from docling.datamodel.pipeline_options import (
VlmPipelineOptions,
ApiVlmOptions,
)
from docling.datamodel.vlm_engine_options import VlmEngineType
pipeline_options = VlmPipelineOptions(
enable_remote_services = True ,
vlm_options = ApiVlmOptions(
url = "http://localhost:11434/v1/chat/completions" ,
model = "llava:7b" ,
engine_type = VlmEngineType. API_OLLAMA ,
),
)
VLM Configuration Options
Customize VLM behavior:
from docling.datamodel.pipeline_options import (
VlmPipelineOptions,
VlmConvertOptions,
)
vlm_options = VlmConvertOptions.from_preset( "granite_docling" )
pipeline_options = VlmPipelineOptions(
vlm_options = vlm_options,
# Image preprocessing
generate_page_images = True , # Required for VLM
force_backend_text = False , # Use VLM text vs. PDF text
# Picture enrichment (optional)
do_picture_description = True ,
do_picture_classification = True ,
)
Image Scaling
Control input image resolution:
from docling.datamodel.pipeline_options import VlmConvertOptions
vlm_options = VlmConvertOptions.from_preset( "granite_docling" )
vlm_options.scale = 2.0 # Image scale factor (2.0 = high res)
vlm_options.max_size = 1024 # Max dimension (width/height)
Higher scale = better accuracy but slower processing.
Manual Configuration (Without Presets)
For full control, manually configure the VLM:
from docling.datamodel.pipeline_options_vlm_model import (
InlineVlmOptions,
InferenceFramework,
ResponseFormat,
TransformersModelType,
)
from docling.datamodel.pipeline_options import VlmPipelineOptions
from docling.datamodel.accelerator_options import AcceleratorDevice
pipeline_options = VlmPipelineOptions(
vlm_options = InlineVlmOptions(
repo_id = "ibm-granite/granite-vision-3.2-2b" ,
prompt = "Convert this page to markdown. Do not miss any text and only output the bare markdown!" ,
response_format = ResponseFormat. MARKDOWN ,
inference_framework = InferenceFramework. TRANSFORMERS ,
transformers_model_type = TransformersModelType. AUTOMODEL_VISION2SEQ ,
supported_devices = [
AcceleratorDevice. CPU ,
AcceleratorDevice. CUDA ,
AcceleratorDevice. MPS ,
],
scale = 2.0 ,
temperature = 0.0 ,
)
)
Picture Description with VLMs
Use VLMs to caption images within documents:
from docling.datamodel.pipeline_options import (
PdfPipelineOptions,
PictureDescriptionVlmEngineOptions,
)
from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption
pipeline_options = PdfPipelineOptions(
do_picture_description = True ,
picture_description_options = PictureDescriptionVlmEngineOptions.from_preset(
"smolvlm" # or "granite_vision", "pixtral", "qwen"
),
)
converter = DocumentConverter(
format_options = {
InputFormat. PDF : PdfFormatOption( pipeline_options = pipeline_options)
}
)
result = converter.convert( "document.pdf" )
# Access picture descriptions
from docling_core.types.doc import PictureItem
for item, level in result.document.iterate_items():
if isinstance (item, PictureItem):
print ( f "Picture: { item.self_ref } " )
print ( f "Caption: { item.caption_text( doc = result.document) } " )
Benchmarked on M3 Max with single test page:
Model Framework Time (sec) Memory SmolDocling MLX 6.2 Low SmolDocling Transformers 102.2 Medium Granite Docling MLX ~8 Low Qwen 2.5 VL 3B MLX 23.5 Medium Pixtral 12B MLX 308.9 High Gemma 3 12B MLX 378.5 High Phi-4 Transformers (CPU) 1175.7 Medium
MLX provides 10-20x speedup on Apple Silicon compared to Transformers. Always use MLX on M1/M2/M3 devices.
When to Use VLM vs. Traditional Pipeline
Use VLM When
Documents have complex layouts
Mixed content types (text, tables, images)
Handwritten or unusual fonts
You want end-to-end learning
Simplicity over control
Use Traditional Pipeline When
You need fine-grained control
Processing simple, structured PDFs
OCR customization is critical
Resources are limited (VLM needs more memory)
Batch processing at scale
Troubleshooting
Pre-download models: docling-tools models download-hf-repo ibm-granite/granite-docling-258M
Then set artifacts_path: pipeline_options.artifacts_path = "/path/to/models"
Use smaller models (SmolDocling instead of Pixtral)
Reduce scale parameter: vlm_options.scale = 1.0
Set max_size: vlm_options.max_size = 768
Process fewer pages at once
Verify MLX installation: pip install mlx mlx-vlm
python -c "import mlx.core as mx; print(mx.metal.is_available())"
Should print True on Apple Silicon.
Increase image scale: vlm_options.scale = 2.0
Try a larger model (Qwen, Pixtral)
Check prompt is appropriate for your content
Use Granite Docling or SmolDocling for DocTags output (better structure)
Next Steps
PDF Processing Compare with traditional PDF pipeline options
Export Formats Export VLM-processed documents to various formats
Batch Processing Optimize VLM processing for document batches
Advanced Options Configure hardware acceleration and remote services