Use Azure Document Intelligence for cloud-based document conversion with advanced OCR and layout analysis
Azure Document Intelligence (formerly Form Recognizer) provides cloud-based document conversion with advanced OCR, layout analysis, and formula extraction capabilities.
Control which file types use Document Intelligence:
from markitdown import MarkItDownfrom markitdown.converters._doc_intel_converter import DocumentIntelligenceFileType# Only use Document Intelligence for PDFs and imagesmd = MarkItDown( docintel_endpoint="https://YOUR_ENDPOINT.cognitiveservices.azure.com/", docintel_file_types=[ DocumentIntelligenceFileType.PDF, DocumentIntelligenceFileType.JPEG, DocumentIntelligenceFileType.PNG, ])
Available file types:
DOCX - Word documents
PPTX - PowerPoint presentations
XLSX - Excel spreadsheets
HTML - HTML files
PDF - PDF documents (with OCR)
JPEG - JPEG images (with OCR)
PNG - PNG images (with OCR)
BMP - BMP images (with OCR)
TIFF - TIFF images (with OCR)
By default, Document Intelligence is used for: DOCX, PPTX, XLSX, PDF, JPEG, PNG, BMP, and TIFF files.
For OCR-supported file types, Document Intelligence enables:
High-Resolution OCR: Enhanced text extraction from images
Formula Extraction: Mathematical formulas converted to LaTeX
Font Style Detection: Preserves bold, italic, and other styles
# These features are automatically enabled for PDF, JPEG, PNG, BMP, TIFFmd = MarkItDown(docintel_endpoint="...")result = md.convert("math_document.pdf")# Output includes LaTeX formulas like: $\frac{a}{b}$
# Only use Document Intelligence for specific file typesmd = MarkItDown( docintel_endpoint="...", docintel_file_types=[ DocumentIntelligenceFileType.PDF, # Only PDFs ])# Other file types will use offline convertersresult1 = md.convert("document.pdf") # Uses Document Intelligenceresult2 = md.convert("document.docx") # Uses offline converter
When Document Intelligence is enabled, it takes priority over built-in converters for supported file types:
md = MarkItDown( docintel_endpoint="https://YOUR_ENDPOINT.cognitiveservices.azure.com/")# Document Intelligence converter is registered at the top of the stack# and will be tried first for supported file types
To explicitly control converter priority:
from markitdown import MarkItDown, PRIORITY_SPECIFIC_FILE_FORMATfrom markitdown.converters import DocumentIntelligenceConvertermd = MarkItDown(enable_builtins=False)md.enable_builtins() # Register built-in converters# Register Document Intelligence with custom prioritymd.register_converter( DocumentIntelligenceConverter( endpoint="https://YOUR_ENDPOINT.cognitiveservices.azure.com/" ), priority=-1.0 # Higher priority than built-ins (0.0))
from markitdown import MarkItDownmd = MarkItDown( docintel_endpoint="https://YOUR_ENDPOINT.cognitiveservices.azure.com/")result = md.convert("scanned_invoice.pdf")print(result.markdown)# Output includes detected tables in Markdown format