Supported Formats
Word Documents
DOCX format with styles and tables
PowerPoint
PPTX with slides, charts, and notes
Excel Spreadsheets
XLSX and XLS with multiple sheets
Outlook Emails
MSG files with metadata
Word Documents (.docx)
Dependencies
Features
- Style Preservation: Headings, bold, italic, and other text formatting
- Tables: Converted to Markdown table format
- Structure: Document hierarchy maintained
- HTML Intermediate: Uses Mammoth to convert to HTML, then to Markdown
Usage Example
Implementation Details
The DOCX converter is implemented in_docx_converter.py:
- Converter Class:
DocxConverter - Accepted Extensions:
.docx - MIME Types:
application/vnd.openxmlformats-officedocument.wordprocessingml.document - Pre-processing: Documents are pre-processed before conversion to handle special cases
- Style Mapping: Supports custom style maps via the
style_mapparameter
Advanced Options
PowerPoint Presentations (.pptx)
Dependencies
Features
- Slide Structure: Each slide marked with slide number
- Headings: Slide titles converted to H1 headings
- Tables: Preserved in Markdown format
- Charts: Chart data extracted into tables
- Images: Images with alt text and optional LLM captioning
- Slide Notes: Speaker notes included
- Base64 Images: Optional inline image embedding
Usage Example
Output Format
Implementation Details
- Converter Class:
PptxConverter(_pptx_converter.py) - Accepted Extensions:
.pptx - MIME Types:
application/vnd.openxmlformats-officedocument.presentationml - Shape Processing: Handles pictures, tables, charts, text frames, and grouped shapes
- Layout Preservation: Shapes sorted by position (top to bottom, left to right)
Excel Spreadsheets (.xlsx, .xls)
Dependencies
Features
- Multiple Sheets: Each sheet converted to a separate Markdown table
- Sheet Names: Used as H2 headings
- Data Preservation: All cell data maintained
- Pandas Integration: Uses pandas for robust Excel parsing
Usage Example
Output Format
Implementation Details
- Converter Classes:
XlsxConverterandXlsConverter(_xlsx_converter.py) - XLSX Extensions:
.xlsx - XLS Extensions:
.xls - Engines:
openpyxlfor XLSX,xlrdfor XLS - Process: Excel → pandas DataFrame → HTML → Markdown
Outlook Messages (.msg)
Dependencies
Features
- Email Headers: From, To, Subject extracted
- Message Body: Full email content preserved
- OLE File Parsing: Uses olefile for .msg structure
Usage Example
Output Format
Implementation Details
- Converter Class:
OutlookMsgConverter(_outlook_msg_converter.py) - Accepted Extensions:
.msg - MIME Types:
application/vnd.ms-outlook - Detection: Validates OLE file structure with
__properties_version1.0marker - Encoding: Handles UTF-16 LE and UTF-8 encodings
Common Options
LLM Integration for Images
LLM Integration for Images
All Office converters that handle images (DOCX, PPTX) support LLM-powered image captioning:
Base64 Image Embedding
Base64 Image Embedding
For PowerPoint files, you can embed images directly in the Markdown:
Error Handling
Error Handling
All converters raise
MissingDependencyException when required libraries are not installed: