Format-specific document parsing backends in Docling
Backends are format-specific parsers that extract raw content from input documents. Each backend knows how to read a particular document format and provide content to pipelines for processing.
Purpose: Formats that can be directly converted to DoclingDocument without complex processing.Base class: DeclarativeDocumentBackend (docling/backend/abstract_backend.py:66)Key method:
@abstractmethoddef convert(self) -> DoclingDocument: """Directly produce a DoclingDocument.""" pass
Examples:
MsWordDocumentBackend: DOCX files
HTMLDocumentBackend: HTML files
MarkdownDocumentBackend: Markdown files
CsvDocumentBackend: CSV files
MsExcelDocumentBackend: Excel files
MsPowerpointDocumentBackend: PowerPoint files
JatsDocumentBackend: JATS XML scientific papers
XBRLDocumentBackend: XBRL financial reports
Declarative backends are used with SimplePipeline which delegates conversion entirely to the backend.
Purpose: Formats that require page-by-page processing with ML models.Base class: PaginatedDocumentBackend (docling/backend/abstract_backend.py:54)Key methods:
@abstractmethoddef page_count(self) -> int: """Total number of pages in the document.""" pass@abstractmethoddef load_page(self, page_no: int) -> PageBackend: """Load a specific page for processing.""" pass
Examples:
DoclingParseDocumentBackend: PDF files (primary backend)
PdfDocumentBackend: Base class for PDF backends
ImageDocumentBackend: Image files treated as single-page documents
Paginated backends are used with StandardPdfPipeline or VlmPipeline for advanced processing.
class DoclingParsePageBackend(PdfPageBackend): def get_segmented_page(self) -> SegmentedPdfPage: """Get parsed page with text cells and structure.""" config = DecodePageConfig( create_word_cells=True, create_line_cells=True, keep_bitmaps=True ) return self._dp_doc.get_page(self._page_no + 1, config=config) def get_text_cells(self) -> Iterable[TextCell]: """Get all text line cells on the page.""" return self._dpage.textline_cells def get_bitmap_rects(self, scale: float = 1) -> Iterable[BoundingBox]: """Get bounding boxes of all images on the page.""" for img in self._dpage.bitmap_resources: yield img.rect.to_bounding_box() def get_page_image(self, scale: float = 1, cropbox: BoundingBox = None): """Render the page as an image.""" return self._ppage.render(scale=scale * 1.5).to_pil() def get_text_in_rect(self, bbox: BoundingBox) -> str: """Extract text within a specific bounding box.""" text = "" for cell in self._dpage.textline_cells: if cell.rect.intersects(bbox): text += cell.text + " " return text.strip()
The pipeline’s _unload() method ensures backends are cleaned up after conversion:
def _unload(self, conv_res: ConversionResult): # Unload page backends for page in conv_res.pages: if page._backend is not None: page._backend.unload() # Unload document backend if conv_res.input._backend: conv_res.input._backend.unload()
This separation allows the same backend (e.g., DoclingParseDocumentBackend) to be used with different pipelines (StandardPdfPipeline vs VlmPipeline) for different processing strategies.
# Good: Load page when neededpage_backend = doc_backend.load_page(page_no)# Bad: Load all pages upfrontall_pages = [doc_backend.load_page(i) for i in range(page_count)]