DocumentConverter

Overview

DocumentConverter is the main entry point for converting documents in Docling. It handles various input formats (PDF, DOCX, PPTX, images, HTML, Markdown, etc.) and provides both single-document and batch conversion capabilities. The conversion methods return a ConversionResult instance for each document, which wraps a DoclingDocument object if the conversion was successful, along with metadata about the conversion process.

Class Definition

from docling.document_converter import DocumentConverter

Constructor

`init()`

Initialize the converter based on format preferences.

converter = DocumentConverter(
    allowed_formats=[InputFormat.PDF, InputFormat.DOCX],
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=PipelineOptions(do_ocr=True)
        )
    }
)

allowed_formats

list[InputFormat]

List of allowed input formats. By default, any format supported by Docling is allowed.

format_options

dict[InputFormat, FormatOption]

Dictionary of format-specific options. Each format can have custom pipeline and backend configurations.

Attributes

allowed_formats

list[InputFormat]

Allowed input formats for conversion.

format_to_options

dict[InputFormat, FormatOption]

Mapping of formats to their configuration options.

initialized_pipelines

dict[tuple[Type[BasePipeline], str], BasePipeline]

Cache of initialized pipelines keyed by (pipeline class, options hash).

Methods

`convert()`

Convert one document fetched from a file path, URL, or DocumentStream.

result = converter.convert("path/to/document.pdf")
print(result.document.export_to_markdown())

source

Union[Path, str, DocumentStream]

required

Source of input document given as file path, URL, or DocumentStream.

headers

dict[str, str]

Optional headers given as a dictionary of string key-value pairs, in case of URL input source.

raises_on_error

bool

default:"True"

Whether to raise an error on the first conversion failure. If False, errors are captured in the ConversionResult objects.

max_num_pages

int

default:"sys.maxsize"

Maximum number of pages accepted per document. Documents exceeding this number will not be converted.

max_file_size

int

default:"sys.maxsize"

Maximum file size to convert (in bytes).

page_range

PageRange

Range of pages to convert.

return

ConversionResult

The conversion result, which contains a DoclingDocument in the document attribute, and metadata about the conversion process.

Raises:

ConversionError: An error occurred during conversion.

`convert_all()`

Convert multiple documents from file paths, URLs, or DocumentStreams.

sources = ["doc1.pdf", "doc2.docx", "https://example.com/doc3.pdf"]
for result in converter.convert_all(sources):
    print(f"{result.input.file}: {result.status}")
    if result.document:
        print(result.document.export_to_markdown())

source

Iterable[Union[Path, str, DocumentStream]]

required

Source of input documents given as an iterable of file paths, URLs, or DocumentStreams.

headers

dict[str, str]

Optional headers given as a (single) dictionary of string key-value pairs, in case of URL input source.

raises_on_error

bool

default:"True"

Whether to raise an error on the first conversion failure.

max_num_pages

int

default:"sys.maxsize"

Maximum number of pages to convert.

max_file_size

int

default:"sys.maxsize"

Maximum number of pages accepted per document. Documents exceeding this number will be skipped.

page_range

PageRange

Range of pages to convert in each document.

yields

Iterator[ConversionResult]

The conversion results, each containing a DoclingDocument in the document attribute and metadata about the conversion process.

Raises:

ConversionError: An error occurred during conversion.

`convert_string()`

Convert a document given as a string using the specified format. Only Markdown (InputFormat.MD) and HTML (InputFormat.HTML) formats are supported. The content is wrapped in a DocumentStream and passed to the main conversion pipeline.

markdown_content = "# Hello\n\nThis is a test document."
result = converter.convert_string(
    content=markdown_content,
    format=InputFormat.MD,
    name="my-document"
)

content

str

required

The document content as a string.

format

InputFormat

required

The format of the input content. Must be either InputFormat.MD or InputFormat.HTML.

name

str

The filename to associate with the document. If not provided, a timestamp-based name is generated. The appropriate file extension (md or html) is appended if missing.

return

ConversionResult

The conversion result, which contains a DoclingDocument in the document attribute, and metadata about the conversion process.

Raises:

ValueError: If format is neither InputFormat.MD nor InputFormat.HTML.
ConversionError: An error occurred during conversion.

`initialize_pipeline()`

Initialize the conversion pipeline for the selected format.

converter.initialize_pipeline(InputFormat.PDF)

format

InputFormat

required

The input format for which to initialize the pipeline.

Raises:

ConversionError: If no pipeline could be initialized for the given format.
RuntimeError: If artifacts_path is set in docling.datamodel.settings.settings when required by the pipeline, but points to a non-directory file.
FileNotFoundError: If local model files are not found.

Complete Example

from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions

# Initialize converter with custom options
converter = DocumentConverter(
    allowed_formats=[InputFormat.PDF, InputFormat.DOCX, InputFormat.MD],
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=PdfPipelineOptions(
                do_ocr=True,
                do_table_structure=True
            )
        )
    }
)

# Convert a single document
result = converter.convert("document.pdf")
if result.status == ConversionStatus.SUCCESS:
    # Export to markdown
    markdown = result.document.export_to_markdown()
    print(markdown)
    
    # Export to JSON
    json_output = result.document.export_to_dict()
    
# Batch conversion
documents = ["doc1.pdf", "doc2.docx", "doc3.html"]
for result in converter.convert_all(documents, raises_on_error=False):
    print(f"Processing {result.input.file.name}...")
    print(f"Status: {result.status}")
    if result.errors:
        for error in result.errors:
            print(f"Error: {error.error_message}")

Core API

Pipelines

Options & Configuration

Backends

CLI

DocumentConverter

Overview

Class Definition

Constructor

`init()`

Attributes

Methods

`convert()`

`convert_all()`

`convert_string()`

`initialize_pipeline()`

Complete Example

See Also

Build docs developers (and LLMs) love

Core API

Pipelines

Options & Configuration

Backends

CLI

​Overview

​Class Definition

​Constructor

​__init__()

​Attributes

​Methods

​convert()

​convert_all()

​convert_string()

​initialize_pipeline()

​Complete Example

​See Also

Build docs developers (and LLMs) love

Overview

Class Definition

Constructor

`init()`

Attributes

Methods

`convert()`

`convert_all()`

`convert_string()`

`initialize_pipeline()`

Complete Example

See Also