Overview
Docling’s architecture is built on a modular design that separates concerns between document parsing, processing pipelines, and output generation. The system is designed to handle multiple document formats through a unified interface while maintaining flexibility and extensibility.Core Components
The architecture consists of four main components that work together to convert documents:1. Document Converter
TheDocumentConverter class (docling/document_converter.py:198) is the main entry point for all document conversions. It:
- Manages format-specific configurations through
FormatOptionmappings - Routes documents to appropriate backends and pipelines based on input format
- Caches initialized pipelines for performance (keyed by pipeline class and options hash)
- Handles both single-document (
convert()) and batch conversion (convert_all())
2. Backends
Backends are format-specific parsers that extract raw content from input documents. See Backend Architecture for details. Key backend types:- Declarative backends: Directly produce
DoclingDocument(DOCX, HTML, Markdown, etc.) - Paginated backends: Provide page-level access for pipeline processing (PDF, images)
3. Pipelines
Pipelines orchestrate the conversion process, applying ML models and transformations to create structured output. See Pipeline Concepts for details. Main pipeline types:SimplePipeline: For declarative backends that output documents directlyStandardPdfPipeline: Multi-threaded PDF processing with OCR, layout analysis, and table extractionVlmPipeline: Vision-language model based conversionAsrPipeline: Audio transcription and conversion
4. DoclingDocument
The unified document representation format that all conversions produce. See DoclingDocument for details.Conversion Flow
The typical document conversion follows this flow:Format Detection
The
DocumentConverter identifies the input format and retrieves the corresponding FormatOption configuration.Backend Initialization
A format-specific backend is instantiated to parse the input document (e.g.,
DoclingParseDocumentBackend for PDFs).Pipeline Execution
The pipeline orchestrates the conversion:
- Build: Extract content using the backend
- Assemble: Structure content into a
DoclingDocument - Enrich: Apply ML models for enhancement (optional)
Pipeline Stages
For complex formats like PDF, pipelines execute multiple stages:- Page Initialization: Load page-level backends and extract raw content
- Preprocessing: Scale images, prepare for model input
- OCR: Text recognition for scanned or image-based content
- Layout Analysis: Detect document structure (headings, paragraphs, tables, figures)
- Table Structure: Parse table cells and relationships
- Assembly: Combine page elements into a unified document
- Enrichment: Apply optional models (picture classification, chart extraction, etc.)
The
StandardPdfPipeline executes these stages in parallel across multiple threads for optimal performance. Pages are processed in batches with configurable concurrency.Format-to-Pipeline Mapping
Each input format has a default backend and pipeline configuration:| Format | Backend | Pipeline | Purpose |
|---|---|---|---|
DoclingParseDocumentBackend | StandardPdfPipeline | Full PDF processing with OCR and layout | |
| IMAGE | ImageDocumentBackend | StandardPdfPipeline | Image-based document conversion |
| DOCX | MsWordDocumentBackend | SimplePipeline | Microsoft Word documents |
| HTML | HTMLDocumentBackend | SimplePipeline | Web content |
| MD | MarkdownDocumentBackend | SimplePipeline | Markdown files |
| AUDIO | NoOpBackend | AsrPipeline | Speech-to-text transcription |
Extensibility
Docling’s architecture supports extension through:- Custom backends: Implement
AbstractDocumentBackendfor new formats - Custom pipelines: Subclass
BasePipelinefor specialized processing - Plugin models: Add ML models via the plugin system
- Custom serializers: Define new export formats
Performance Considerations
Pipeline Caching
TheDocumentConverter caches initialized pipelines using a composite key of (pipeline_class, options_hash). This means:
- Pipelines with identical configurations are reused across documents
- Heavy ML models are loaded once per pipeline instance
- Thread-safe access ensures concurrent conversions can share pipelines
Batch Processing
For optimal throughput when processing multiple documents:- Document-level parallelism (configurable via
settings.perf.doc_batch_concurrency) - Efficient resource utilization
- Pipeline sharing across documents
References
Docling Technical Report
Deep dive into Docling’s architecture and design decisions
Backend Architecture
Learn about format-specific document backends
Pipeline Concepts
Understand pipeline orchestration and processing
DoclingDocument
Explore the unified document representation