Skip to main content

Overview

Docling’s document processing pipeline consists of multiple stages, each using specialized models and inference engines. This catalog provides:
  • Processing stages and their purposes
  • Model families and specific models
  • Inference engine compatibility
  • Usage examples and configuration
Source: ~/workspace/source/docs/usage/model_catalog.md:1

Processing Stages

Docling pipelines are composed of these processing stages:

Layout

Document structure detection

OCR

Optical character recognition

Table Structure

Table cell recognition

Picture Classifier

Image type classification

VLM Convert

Full page conversion with VLMs

Picture Description

Image captioning

Code & Formula

Code/math extraction

Layout Detection

Overview

Source: ~/workspace/source/docs/usage/model_catalog.md:26 Detects document elements (paragraphs, tables, figures, headers, etc.) using RT-DETR-based object detection. Model Family: Object Detection (RT-DETR based)
Inference Engine: docling-ibm-models
Supported Devices: CPU, CUDA, MPS, XPU

Available Models

Source: ~/workspace/source/docs/usage/model_catalog.md:30
ModelStatusDescription
docling-layout-heron⭐ DefaultRecommended for most use cases
docling-layout-heron-101-Enhanced variant of Heron
docling-layout-egret-medium-Medium-sized Egret model
docling-layout-egret-large-Larger Egret model
docling-layout-egret-xlarge-Extra-large Egret model
docling-layout-v2LegacyPrevious generation model

Usage

Source: ~/workspace/source/docs/usage/model_catalog.md:252
from docling.datamodel.pipeline_options import LayoutOptions
from docling.datamodel.layout_model_specs import DOCLING_LAYOUT_HERON

# Use Heron layout model (default)
layout_options = LayoutOptions(
    model_spec=DOCLING_LAYOUT_HERON
)

Output

Bounding boxes with element labels:
  • TEXT - Body text paragraphs
  • SECTION_HEADER - Section headings
  • TABLE - Tables
  • PICTURE - Images and figures
  • LIST_ITEM - List items
  • FORMULA - Mathematical formulas
  • PAGE_HEADER / PAGE_FOOTER - Headers/footers

OCR (Optical Character Recognition)

Overview

Source: ~/workspace/source/docs/usage/model_catalog.md:51 Extracts text from images and scanned documents using various OCR engines. Model Family: Multiple OCR Engines
Inference Engines: Engine-specific
Supported Devices: Varies by engine

Available Engines

Source: ~/workspace/source/docs/usage/model_catalog.md:206
OCR EngineBackendLanguagesGPU SupportNotes
AutoAutomaticVariesVariesAutomatically selects best available
TesseractCLI or Python100+NoMost widely used, good accuracy
EasyOCRPyTorch80+YesGPU-accelerated, good for Asian languages
RapidOCRONNX/OpenVINO/PaddleMultipleYes (torch)Fast, multiple backend options
macOS VisionNative macOS20+YesmacOS only, excellent quality
SuryaOCRPyTorch90+YesModern, good for complex layouts

Usage

Source: ~/workspace/source/docs/usage/model_catalog.md:286
from docling.datamodel.pipeline_options import (
    TesseractOcrOptions,
    RapidOcrOptions
)

# Tesseract with multiple languages
ocr_options = TesseractOcrOptions(
    lang=["eng", "deu"]  # English and German
)

# RapidOCR with GPU acceleration
ocr_options = RapidOcrOptions(
    backend="torch",  # GPU-accelerated
    lang=["en"]
)

Table Structure Recognition

TableFormer Models

Source: ~/workspace/source/docs/usage/model_catalog.md:70 Recognizes table structure (rows, columns, cells) and relationships. Model Family: TableFormer
Inference Engine: docling-ibm-models
Supported Devices: CPU, CUDA, XPU (MPS currently disabled)

Available Modes

Source: ~/workspace/source/docs/usage/model_catalog.md:74
ModeStatusSpeedAccuracy
Accurate⭐ DefaultSlowerHigher quality
Fast-FasterGood quality

Usage

Source: ~/workspace/source/docs/usage/model_catalog.md:263
from docling.datamodel.pipeline_options import (
    TableStructureOptions,
    TableFormerMode
)

# Use accurate mode for best quality
table_options = TableStructureOptions(
    mode=TableFormerMode.ACCURATE,
    do_cell_matching=True  # Align cells with content
)

Object Detection (WIP)

Source: ~/workspace/source/docs/usage/model_catalog.md:86 Alternative approach for table structure recognition using object detection.
Object detection-based table structure is work in progress.

Picture Classification

Overview

Source: ~/workspace/source/docs/usage/model_catalog.md:101 Classifies pictures into semantic categories (charts, diagrams, logos, etc.). Model Family: Image Classifier (Vision Transformer)
Inference Engine: Transformers (ViT)
Supported Devices: CPU, CUDA, MPS, XPU

Available Models

Source: ~/workspace/source/docs/usage/model_catalog.md:104
ModelStatusDescription
DocumentFigureClassifier-v2.0⭐ DefaultSpecialized for document imagery
Model Card: ds4sd/DocumentFigureClassifier

Supported Classes

  • Chart types (bar, line, pie, scatter)
  • Diagrams and flowcharts
  • Natural images
  • Logos and branding
  • Signatures
  • Technical illustrations

Usage

Source: ~/workspace/source/docs/usage/model_catalog.md:275
from docling.models.stages.picture_classifier.document_picture_classifier import (
    DocumentPictureClassifierOptions
)

# Use default picture classifier
classifier_options = DocumentPictureClassifierOptions.from_preset(
    "document_figure_classifier_v2"
)

VLM Convert (Full Page)

Overview

Source: ~/workspace/source/docs/usage/model_catalog.md:116 Converts entire document pages to structured formats using vision-language models. Model Family: Vision-Language Models
Output Formats: DocTags (structured), Markdown (human-readable)
Inference Engines: Transformers, MLX, API (Ollama, LM Studio, OpenAI), vLLM, AUTO_INLINE

Available Models

Source: ~/workspace/source/docs/usage/model_catalog.md:220
Preset IDModelSizeTransformersMLXAPIvLLMOutput
granite_doclingGranite-Docling-258M258MOllamaDocTags
smoldoclingSmolDocling-256M256MDocTags
deepseek_ocrDeepSeek-OCR-3B3BOllama, LM StudioMarkdown
granite_visionGranite-Vision-3.3-2B2BOllama, LM StudioMarkdown
pixtralPixtral-12B12BMarkdown
got_ocrGOT-OCR-2.0-Markdown
phi4Phi-4-Multimodal-Markdown
qwenQwen2.5-VL-3B3BMarkdown
gemma_12bGemma-3-12B12BMarkdown
gemma_27bGemma-3-27B27BMarkdown
dolphinDolphin-Markdown

Usage

Source: ~/workspace/source/docs/usage/model_catalog.md:294
from docling.datamodel.pipeline_options import VlmConvertOptions

# Use SmolDocling with auto-selected engine
options = VlmConvertOptions.from_preset("smoldocling")

# Force specific engine
from docling.datamodel.vlm_engine_options import MlxVlmEngineOptions

options = VlmConvertOptions.from_preset(
    "smoldocling",
    engine_options=MlxVlmEngineOptions()
)

Output Formats

DocTags: Structured XML-like format optimized for document understanding
<document>
  <section_header>Introduction</section_header>
  <text>This is a paragraph...</text>
  <table>
    <row><cell>Data</cell></row>
  </table>
</document>
Markdown: Human-readable format for general-purpose conversion
# Introduction

This is a paragraph...

| Column 1 | Column 2 |
|----------|----------|
| Data     | Data     |

Picture Description

Overview

Source: ~/workspace/source/docs/usage/model_catalog.md:143 Generates natural language descriptions (captions) of images and figures. Model Family: Vision-Language Models
Inference Engines: Transformers, MLX, API (Ollama, LM Studio), vLLM, AUTO_INLINE

Available Models

Source: ~/workspace/source/docs/usage/model_catalog.md:236
Preset IDModelSizeTransformersMLXAPIvLLM
smolvlmSmolVLM-256M256MLM Studio
granite_visionGranite-Vision-3.3-2B2BOllama, LM Studio
pixtralPixtral-12B12B
qwenQwen2.5-VL-3B3B

Usage

Source: ~/workspace/source/docs/usage/model_catalog.md:310
from docling.datamodel.pipeline_options import PictureDescriptionVlmOptions

# Use Granite Vision for detailed descriptions
options = PictureDescriptionVlmOptions.from_preset("granite_vision")

Code & Formula Extraction

Overview

Source: ~/workspace/source/docs/usage/model_catalog.md:161 Extracts and recognizes code blocks and mathematical formulas. Model Family: Vision-Language Models
Inference Engines: Transformers, MLX, AUTO_INLINE

Available Models

Source: ~/workspace/source/docs/usage/model_catalog.md:244
Preset IDModelTransformersMLX
codeformulav2CodeFormulaV2
granite_doclingGranite-Docling-258M
Model Card: ds4sd/CodeFormula

Usage

Source: ~/workspace/source/docs/usage/model_catalog.md:318
from docling.datamodel.pipeline_options import CodeFormulaVlmOptions

# Use specialized CodeFormulaV2 model
options = CodeFormulaVlmOptions.from_preset("codeformulav2")

Inference Engine Compatibility

Object Detection Models

Source: ~/workspace/source/docs/usage/model_catalog.md:182
StageEngineDevices
Layoutdocling-ibm-modelsCPU, CUDA, MPS, XPU
Table Structuredocling-ibm-modelsCPU, CUDA, XPU
MPS is currently disabled for TableFormer due to performance issues.

Vision-Language Models

Source: ~/workspace/source/docs/usage/model_catalog.md:220 VLM inference engines support varies by model:
  • Transformers: Direct HuggingFace transformers integration
  • MLX: Apple Silicon optimized (macOS only)
  • API: OpenAI-compatible endpoints (Ollama, LM Studio, vLLM)
  • vLLM: Linux-only high-performance server
  • AUTO_INLINE: Automatic engine selection

Model Selection Guide

Recommended: docling-layout-heron
  • Good balance of speed and accuracy
  • Suitable for most document types
  • Use Egret models for specialized needs
Recommended: Auto or Tesseract
  • Auto: Automatic engine selection
  • Tesseract: Reliable, widely supported
  • RapidOCR (torch): GPU acceleration needed
  • macOS Vision: Best quality on macOS
Recommended: Accurate mode
  • Use Accurate for production (better quality)
  • Use Fast for quick prototyping
  • Enable do_cell_matching for best results
Recommended: granite_docling or smoldocling
  • Granite Docling: Best for structured output (DocTags)
  • SmolDocling: Lightweight alternative
  • DeepSeek OCR: High-quality Markdown (API-only)
  • Larger models (Pixtral, Qwen) for complex documents
Recommended: smolvlm
  • SmolVLM: Fast, good quality, small size
  • Granite Vision: More detailed descriptions
  • Larger models for specialized captioning

Performance Characteristics

Model Sizes and Speed

Model TypeSize RangeTypical SpeedGPU Benefit
Layout Detection~100-500MBFastHigh
OCR EnginesVariesFast-MediumVaries
Table Structure~100MBMediumHigh
Picture Classifier~100MBFastMedium
Small VLMs (256M)~500MB-1GBFastHigh
Medium VLMs (2-3B)2-6GBMediumVery High
Large VLMs (12B+)12GB+SlowCritical

Device Recommendations

CPU Only

  • Layout: Heron
  • OCR: Tesseract/Auto
  • VLM: SmolVLM/SmolDocling (small models only)
  • Expect slower processing

NVIDIA GPU

  • All models supported
  • Use batch processing
  • Consider Flash Attention 2
  • Ideal for VLM pipelines with inference servers

Apple Silicon

  • Layout: All models via MPS
  • VLM: MLX-optimized models (Granite, SmolDocling)
  • Good performance for small-medium models
  • Use MLX engine when available

Intel GPU

  • Layout: All models via XPU
  • Table Structure: Supported
  • Limited VLM support
  • Check compatibility for specific models

Additional Resources

Source: ~/workspace/source/docs/usage/model_catalog.md:328

Vision Models Guide

VLM-specific documentation

GPU Acceleration

GPU acceleration setup

Pipeline Options

Advanced configuration

Supported Formats

Input format support

Notes

Source: ~/workspace/source/docs/usage/model_catalog.md:335
  • DocTags Format: Structured XML-like format optimized for document understanding
  • Markdown Format: Human-readable format for general-purpose conversion
  • Model Updates: New models are added regularly - check the codebase for latest additions
  • Engine Compatibility: Not all engines work on all platforms - AUTO_INLINE handles this automatically
  • Performance: Actual performance varies by hardware, document complexity, and model size
Use AUTO_INLINE engine for VLMs to automatically select the best available inference engine for your platform.

Build docs developers (and LLMs) love