Backends

Backends are format-specific parsers that extract raw content from input documents. Each backend knows how to read a particular document format and provide content to pipelines for processing.

Backend Architecture

All backends inherit from AbstractDocumentBackend (docling/backend/abstract_backend.py:19) and implement a common interface:

class AbstractDocumentBackend(ABC):
    @abstractmethod
    def is_valid(self) -> bool:
        """Check if the document was loaded successfully."""
        pass
    
    @classmethod
    @abstractmethod
    def supports_pagination(cls) -> bool:
        """Whether the backend provides page-level access."""
        pass
    
    @classmethod
    @abstractmethod
    def supported_formats(cls) -> set[InputFormat]:
        """Input formats this backend can handle."""
        pass

Backend Types

Docling defines two main backend categories:

Declarative Backends

Purpose: Formats that can be directly converted to DoclingDocument without complex processing. Base class: DeclarativeDocumentBackend (docling/backend/abstract_backend.py:66) Key method:

@abstractmethod
def convert(self) -> DoclingDocument:
    """Directly produce a DoclingDocument."""
    pass

Examples:

MsWordDocumentBackend: DOCX files
HTMLDocumentBackend: HTML files
MarkdownDocumentBackend: Markdown files
CsvDocumentBackend: CSV files
MsExcelDocumentBackend: Excel files
MsPowerpointDocumentBackend: PowerPoint files
JatsDocumentBackend: JATS XML scientific papers
XBRLDocumentBackend: XBRL financial reports

Declarative backends are used with SimplePipeline which delegates conversion entirely to the backend.

Paginated Backends

Purpose: Formats that require page-by-page processing with ML models. Base class: PaginatedDocumentBackend (docling/backend/abstract_backend.py:54) Key methods:

@abstractmethod
def page_count(self) -> int:
    """Total number of pages in the document."""
    pass

@abstractmethod
def load_page(self, page_no: int) -> PageBackend:
    """Load a specific page for processing."""
    pass

Examples:

DoclingParseDocumentBackend: PDF files (primary backend)
PdfDocumentBackend: Base class for PDF backends
ImageDocumentBackend: Image files treated as single-page documents

Paginated backends are used with StandardPdfPipeline or VlmPipeline for advanced processing.

PDF Backends

DoclingParseDocumentBackend

Implementation: docling/backend/docling_parse_backend.py:202 Purpose: High-performance PDF parsing using the docling-parse library. Features:

Native PDF text extraction
Vector graphics parsing
Bitmap detection
Character-level positioning
Word and line segmentation

Architecture:

class DoclingParseDocumentBackend(PdfDocumentBackend):
    def __init__(self, in_doc, path_or_stream, options):
        # pypdfium2 for rendering
        self._pdoc = pdfium.PdfDocument(path_or_stream)
        
        # docling-parse for text extraction
        self.parser = DoclingPdfParser()
        self.dp_doc = self.parser.load(path_or_stream)
    
    def load_page(self, page_no: int) -> DoclingParsePageBackend:
        ppage = self._pdoc[page_no]  # pypdfium2 page
        return DoclingParsePageBackend(
            dp_doc=self.dp_doc,
            page_obj=ppage,
            page_no=page_no
        )

Page Backend (DoclingParsePageBackend):

class DoclingParsePageBackend(PdfPageBackend):
    def get_segmented_page(self) -> SegmentedPdfPage:
        """Get parsed page with text cells and structure."""
        config = DecodePageConfig(
            create_word_cells=True,
            create_line_cells=True,
            keep_bitmaps=True
        )
        return self._dp_doc.get_page(self._page_no + 1, config=config)
    
    def get_text_cells(self) -> Iterable[TextCell]:
        """Get all text line cells on the page."""
        return self._dpage.textline_cells
    
    def get_bitmap_rects(self, scale: float = 1) -> Iterable[BoundingBox]:
        """Get bounding boxes of all images on the page."""
        for img in self._dpage.bitmap_resources:
            yield img.rect.to_bounding_box()
    
    def get_page_image(self, scale: float = 1, cropbox: BoundingBox = None):
        """Render the page as an image."""
        return self._ppage.render(scale=scale * 1.5).to_pil()
    
    def get_text_in_rect(self, bbox: BoundingBox) -> str:
        """Extract text within a specific bounding box."""
        text = ""
        for cell in self._dpage.textline_cells:
            if cell.rect.intersects(bbox):
                text += cell.text + " "
        return text.strip()

Usage with options:

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.backend_options import PdfBackendOptions
from docling.backend.docling_parse_backend import DoclingParseDocumentBackend

backend_options = PdfBackendOptions(
    password="secret"  # For encrypted PDFs
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            backend=DoclingParseDocumentBackend,
            backend_options=backend_options
        )
    }
)

Alternative PDF Backends

Docling also provides:

DoclingParseV2Backend: Experimental v2 parser
DoclingParseV4Backend: Experimental v4 parser
LegacyStandardPdfBackend: Legacy implementation

These are primarily for testing and development.

Declarative Backend Examples

HTMLDocumentBackend

Implementation: docling/backend/html_backend.py Features:

Parses HTML structure into DoclingDocument
Preserves hierarchy (headings, paragraphs, lists)
Extracts tables with structure
Handles embedded images

Usage:

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("page.html")

MarkdownDocumentBackend

Implementation: docling/backend/md_backend.py Features:

Parses Markdown syntax
Converts to semantic document structure
Supports tables, code blocks, images

Options:

from docling.document_converter import DocumentConverter, MarkdownFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.backend_options import MarkdownBackendOptions

backend_options = MarkdownBackendOptions(
    # Backend-specific options if available
)

converter = DocumentConverter(
    format_options={
        InputFormat.MD: MarkdownFormatOption(backend_options=backend_options)
    }
)

MsWordDocumentBackend

Implementation: docling/backend/msword_backend.py Features:

Extracts text, tables, and images from DOCX
Preserves document structure and styles
Handles equations (via LaTeX/OMML conversion)

DoclingJSONBackend

Implementation: docling/backend/json/docling_json_backend.py Purpose: Load previously exported Docling JSON format. Usage:

from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat

converter = DocumentConverter(
    allowed_formats=[InputFormat.JSON_DOCLING]
)
result = converter.convert("document.json")

Backend Options

Many backends accept options to customize behavior:

PdfBackendOptions

from docling.datamodel.backend_options import PdfBackendOptions
from pydantic import SecretStr

options = PdfBackendOptions(
    password=SecretStr("pdf_password")  # For encrypted PDFs
)

HTMLBackendOptions

from docling.datamodel.backend_options import HTMLBackendOptions

options = HTMLBackendOptions(
    # HTML-specific options
)

MarkdownBackendOptions

from docling.datamodel.backend_options import MarkdownBackendOptions

options = MarkdownBackendOptions(
    # Markdown-specific options
)

XBRLBackendOptions

from docling.datamodel.backend_options import XBRLBackendOptions

options = XBRLBackendOptions(
    # XBRL-specific options for financial documents
)

Backend Lifecycle

Initialization

Backends are instantiated by the InputDocument during format detection:

backend = BackendClass(
    in_doc=input_document,
    path_or_stream=file_stream,
    options=backend_options
)

Resource Management

Backends must implement proper cleanup:

def unload(self):
    """Release resources (close files, free memory)."""
    if isinstance(self.path_or_stream, BytesIO):
        self.path_or_stream.close()
    self.path_or_stream = None

The pipeline’s _unload() method ensures backends are cleaned up after conversion:

def _unload(self, conv_res: ConversionResult):
    # Unload page backends
    for page in conv_res.pages:
        if page._backend is not None:
            page._backend.unload()
    
    # Unload document backend
    if conv_res.input._backend:
        conv_res.input._backend.unload()

Creating Custom Backends

You can implement custom backends for new formats:

Declarative Backend Example

from docling.backend.abstract_backend import DeclarativeDocumentBackend
from docling_core.types.doc import DoclingDocument
from docling.datamodel.base_models import InputFormat

class CustomFormatBackend(DeclarativeDocumentBackend):
    def convert(self) -> DoclingDocument:
        # Parse your format
        content = self.path_or_stream.read()
        
        # Create DoclingDocument
        doc = DoclingDocument(name=self.file.name)
        
        # Add content
        # ... your parsing logic ...
        
        return doc
    
    def is_valid(self) -> bool:
        return self.path_or_stream is not None
    
    @classmethod
    def supports_pagination(cls) -> bool:
        return False
    
    @classmethod
    def supported_formats(cls) -> set[InputFormat]:
        return {InputFormat.CUSTOM}  # Define your format

Paginated Backend Example

from docling.backend.abstract_backend import PaginatedDocumentBackend

class CustomPagedBackend(PaginatedDocumentBackend):
    def page_count(self) -> int:
        # Return total pages
        return self._page_count
    
    def load_page(self, page_no: int):
        # Return page-specific backend
        return CustomPageBackend(self, page_no)
    
    def is_valid(self) -> bool:
        return self.page_count() > 0
    
    @classmethod
    def supports_pagination(cls) -> bool:
        return True
    
    @classmethod
    def supported_formats(cls) -> set[InputFormat]:
        return {InputFormat.CUSTOM}

Backend Selection

The DocumentConverter selects backends based on the format-to-options mapping:

# Default mapping (in document_converter.py)
format_to_default_options = {
    InputFormat.PDF: PdfFormatOption(
        backend=DoclingParseDocumentBackend,
        pipeline_cls=StandardPdfPipeline
    ),
    InputFormat.DOCX: WordFormatOption(
        backend=MsWordDocumentBackend,
        pipeline_cls=SimplePipeline
    ),
    # ... etc
}

You can override defaults:

from docling.document_converter import DocumentConverter, FormatOption
from docling.pipeline.vlm_pipeline import VlmPipeline

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: FormatOption(
            backend=DoclingParseDocumentBackend,  # Custom backend
            pipeline_cls=VlmPipeline,             # Custom pipeline
            pipeline_options=VlmPipelineOptions(...)
        )
    }
)

Backend vs Pipeline Responsibilities

Backends handle:

Format-specific parsing
Raw text extraction
Image access
Structure information (when available)

Pipelines handle:

ML model application (OCR, layout, tables)
Document assembly
Hierarchy construction
Enrichment

This separation allows the same backend (e.g., DoclingParseDocumentBackend) to be used with different pipelines (StandardPdfPipeline vs VlmPipeline) for different processing strategies.

Performance Considerations

Lazy Loading

Paginated backends should load pages on demand:

# Good: Load page when needed
page_backend = doc_backend.load_page(page_no)

# Bad: Load all pages upfront
all_pages = [doc_backend.load_page(i) for i in range(page_count)]

Resource Cleanup

Always unload backends to free resources:

try:
    result = converter.convert("document.pdf")
finally:
    # Pipeline handles cleanup automatically
    pass

Thread Safety

PDF backends use locks for thread-safe access to pypdfium2:

from docling.utils.locks import pypdfium2_lock

with pypdfium2_lock:
    page = self._pdoc[page_no]

Pipelines

Learn how pipelines use backends for processing

Architecture

Understand how backends fit in Docling’s architecture

docling-parse

Explore the high-performance PDF parsing library

Usage Guide

See backend configuration examples

Get Started

Core Concepts

Usage Guides

Advanced Features

Integrations

Backend Architecture

Backend Types

Declarative Backends

Paginated Backends

PDF Backends

DoclingParseDocumentBackend

Alternative PDF Backends

Declarative Backend Examples

HTMLDocumentBackend

MarkdownDocumentBackend

MsWordDocumentBackend

DoclingJSONBackend

Backend Options

PdfBackendOptions

HTMLBackendOptions

MarkdownBackendOptions

XBRLBackendOptions

Backend Lifecycle

Initialization

Resource Management

Creating Custom Backends

Declarative Backend Example

Paginated Backend Example

Backend Selection

Backend vs Pipeline Responsibilities

Performance Considerations

Lazy Loading

Resource Cleanup

Thread Safety

Pipelines

Architecture

docling-parse

Usage Guide

Build docs developers (and LLMs) love

Get Started

Core Concepts

Usage Guides

Advanced Features

Integrations

​Backend Architecture

​Backend Types

​Declarative Backends

​Paginated Backends

​PDF Backends

​DoclingParseDocumentBackend

​Alternative PDF Backends

​Declarative Backend Examples

​HTMLDocumentBackend

​MarkdownDocumentBackend

​MsWordDocumentBackend

​DoclingJSONBackend

​Backend Options

​PdfBackendOptions

​HTMLBackendOptions

​MarkdownBackendOptions

​XBRLBackendOptions

​Backend Lifecycle

​Initialization

​Resource Management

​Creating Custom Backends

​Declarative Backend Example

​Paginated Backend Example

​Backend Selection

​Backend vs Pipeline Responsibilities

​Performance Considerations

​Lazy Loading

​Resource Cleanup

​Thread Safety

​Related Topics

Pipelines

Architecture

docling-parse

Usage Guide

Build docs developers (and LLMs) love

Backend Architecture

Backend Types

Declarative Backends

Paginated Backends

PDF Backends

DoclingParseDocumentBackend

Alternative PDF Backends

Declarative Backend Examples

HTMLDocumentBackend

MarkdownDocumentBackend

MsWordDocumentBackend

DoclingJSONBackend

Backend Options

PdfBackendOptions

HTMLBackendOptions

MarkdownBackendOptions

XBRLBackendOptions

Backend Lifecycle

Initialization

Resource Management

Creating Custom Backends

Declarative Backend Example

Paginated Backend Example

Backend Selection

Backend vs Pipeline Responsibilities

Performance Considerations

Lazy Loading

Resource Cleanup

Thread Safety

Related Topics