Skip to main content
Docling provides extensive customization options for PDF conversion, allowing you to toggle OCR engines, backends, and pipeline settings.

Overview

This example demonstrates:
  • How to configure OCR options (EasyOCR, Tesseract, macOS OCR)
  • Switching between PDF backends
  • Customizing table structure recognition
  • Setting accelerator options for GPU/CPU

Basic Configuration

custom_convert.py
from pathlib import Path
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    TableStructureOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption

input_doc_path = Path("path/to/document.pdf")

# Configure pipeline options
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options = TableStructureOptions(
    do_cell_matching=True
)
pipeline_options.ocr_options.lang = ["es"]  # Set language
pipeline_options.accelerator_options = AcceleratorOptions(
    num_threads=4, device=AcceleratorDevice.AUTO
)

doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

result = doc_converter.convert(input_doc_path)

OCR Engine Options

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options.lang = ["en", "de"]
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options = TableStructureOptions(
    do_cell_matching=True
)

Export Results

1

Convert Document

Process the document with your custom configuration.
2

Export to Multiple Formats

Save results as JSON, Markdown, plain text, and doctags.
from pathlib import Path
import json

output_dir = Path("scratch")
output_dir.mkdir(parents=True, exist_ok=True)
doc_filename = result.input.file.stem

# Export to JSON
with (output_dir / f"{doc_filename}.json").open("w") as fp:
    fp.write(json.dumps(result.document.export_to_dict()))

# Export to Markdown
with (output_dir / f"{doc_filename}.md").open("w") as fp:
    fp.write(result.document.export_to_markdown())

# Export to plain text
with (output_dir / f"{doc_filename}.txt").open("w") as fp:
    fp.write(result.document.export_to_markdown(strict_text=True))

# Export to doctags
with (output_dir / f"{doc_filename}.doctags").open("w") as fp:
    fp.write(result.document.export_to_doctags())
Adjust pipeline_options.ocr_options.lang to match your document’s language. Examples: ["en"], ["es"], ["en", "de"].

Accelerator Configuration

Tune performance with accelerator options:
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions

pipeline_options.accelerator_options = AcceleratorOptions(
    num_threads=4,
    device=AcceleratorDevice.AUTO  # or CPU, CUDA, MPS
)

Build docs developers (and LLMs) love