The convert_all() method processes multiple documents and returns an iterator:
from pathlib import Pathfrom docling.document_converter import DocumentConverterfrom docling.datamodel.base_models import ConversionStatussources = [ Path("documents/report_2023.pdf"), Path("documents/presentation.pptx"), Path("documents/data.xlsx"), "https://example.com/whitepaper.pdf",]converter = DocumentConverter()for result in converter.convert_all(sources, raises_on_error=False): if result.status == ConversionStatus.SUCCESS: print(f"✓ Converted: {result.input.file.name}") # Process or save the document result.document.save_as_markdown(f"output/{result.input.file.stem}.md") else: print(f"✗ Failed: {result.input.file.name}") for error in result.errors: print(f" Error: {error.error_message}")
convert_all() yields results as they’re ready, not all at once. This allows processing huge document collections without exhausting memory.
For very large batches, process in chunks to control memory usage:
from pathlib import Pathfrom docling.document_converter import DocumentConverterdef process_in_chunks(files: list[Path], chunk_size: int = 50): """Process files in chunks to limit memory usage.""" converter = DocumentConverter() for i in range(0, len(files), chunk_size): chunk = files[i:i + chunk_size] print(f"Processing chunk {i//chunk_size + 1}: {len(chunk)} files") for result in converter.convert_all(chunk, raises_on_error=False): if result.status == ConversionStatus.SUCCESS: # Process and immediately save/discard result.document.save_as_markdown( f"output/{result.input.file.stem}.md" ) # Document is garbage collected after this iterationinput_files = list(Path("documents/").glob("**/*.pdf"))print(f"Total files: {len(input_files)}")process_in_chunks(input_files, chunk_size=50)
from docling.datamodel.pipeline_options import PdfPipelineOptionspipeline_options = PdfPipelineOptions( do_ocr=False, # Disable if PDFs have text layer do_table_structure=False, # Disable if not extracting tables generate_page_images=False, # Disable if not needed generate_picture_images=False, # Disable if not needed)
2
Use appropriate batch settings
Balance throughput and resource usage:
from docling.datamodel.settings import settings# For CPU-bound workloadssettings.perf.doc_batch_size = 10settings.perf.doc_batch_concurrency = 4# For memory-constrained environmentssettings.perf.doc_batch_size = 5settings.perf.doc_batch_concurrency = 2
3
Set document timeouts
Prevent individual documents from blocking the pipeline:
pipeline_options = PdfPipelineOptions( document_timeout=120.0 # 2 minutes max per document)
4
Filter documents upfront
Skip obviously problematic files:
input_files = [ f for f in Path("documents/").glob("**/*.pdf") if f.stat().st_size < 50_000_000 # Skip files > 50MB and f.stat().st_size > 0 # Skip empty files]