Docling provides extensive customization options for PDF conversion, allowing you to toggle OCR engines, backends, and pipeline settings.
Overview
This example demonstrates:
- How to configure OCR options (EasyOCR, Tesseract, macOS OCR)
- Switching between PDF backends
- Customizing table structure recognition
- Setting accelerator options for GPU/CPU
Basic Configuration
from pathlib import Path
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
PdfPipelineOptions,
TableStructureOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption
input_doc_path = Path("path/to/document.pdf")
# Configure pipeline options
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options = TableStructureOptions(
do_cell_matching=True
)
pipeline_options.ocr_options.lang = ["es"] # Set language
pipeline_options.accelerator_options = AcceleratorOptions(
num_threads=4, device=AcceleratorDevice.AUTO
)
doc_converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
result = doc_converter.convert(input_doc_path)
OCR Engine Options
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options.lang = ["en", "de"]
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options = TableStructureOptions(
do_cell_matching=True
)
Export Results
Convert Document
Process the document with your custom configuration.
Export to Multiple Formats
Save results as JSON, Markdown, plain text, and doctags.
from pathlib import Path
import json
output_dir = Path("scratch")
output_dir.mkdir(parents=True, exist_ok=True)
doc_filename = result.input.file.stem
# Export to JSON
with (output_dir / f"{doc_filename}.json").open("w") as fp:
fp.write(json.dumps(result.document.export_to_dict()))
# Export to Markdown
with (output_dir / f"{doc_filename}.md").open("w") as fp:
fp.write(result.document.export_to_markdown())
# Export to plain text
with (output_dir / f"{doc_filename}.txt").open("w") as fp:
fp.write(result.document.export_to_markdown(strict_text=True))
# Export to doctags
with (output_dir / f"{doc_filename}.doctags").open("w") as fp:
fp.write(result.document.export_to_doctags())
Adjust pipeline_options.ocr_options.lang to match your document’s language. Examples: ["en"], ["es"], ["en", "de"].
Accelerator Configuration
Tune performance with accelerator options:
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
pipeline_options.accelerator_options = AcceleratorOptions(
num_threads=4,
device=AcceleratorDevice.AUTO # or CPU, CUDA, MPS
)