Skip to main content
MarkItDown provides robust support for Microsoft Office document formats, preserving structure, tables, and formatting during conversion.

Supported Formats

Word Documents

DOCX format with styles and tables

PowerPoint

PPTX with slides, charts, and notes

Excel Spreadsheets

XLSX and XLS with multiple sheets

Outlook Emails

MSG files with metadata

Word Documents (.docx)

Dependencies

pip install mammoth

Features

  • Style Preservation: Headings, bold, italic, and other text formatting
  • Tables: Converted to Markdown table format
  • Structure: Document hierarchy maintained
  • HTML Intermediate: Uses Mammoth to convert to HTML, then to Markdown

Usage Example

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.docx")
print(result.markdown)

Implementation Details

The DOCX converter is implemented in _docx_converter.py:
  • Converter Class: DocxConverter
  • Accepted Extensions: .docx
  • MIME Types: application/vnd.openxmlformats-officedocument.wordprocessingml.document
  • Pre-processing: Documents are pre-processed before conversion to handle special cases
  • Style Mapping: Supports custom style maps via the style_map parameter

Advanced Options

# Custom style mapping
result = md.convert("document.docx", style_map="b => strong")

PowerPoint Presentations (.pptx)

Dependencies

pip install python-pptx

Features

  • Slide Structure: Each slide marked with slide number
  • Headings: Slide titles converted to H1 headings
  • Tables: Preserved in Markdown format
  • Charts: Chart data extracted into tables
  • Images: Images with alt text and optional LLM captioning
  • Slide Notes: Speaker notes included
  • Base64 Images: Optional inline image embedding

Usage Example

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("presentation.pptx")
print(result.markdown)

Output Format

<!-- Slide number: 1 -->
# Welcome to Our Presentation

Introductory slide content here.

### Notes:
Speaker notes for this slide.

<!-- Slide number: 2 -->
# Key Features

![Chart showing growth](data:image/png;base64,...)

### Chart: Quarterly Results

| Category | Q1 | Q2 | Q3 | Q4 |
|----------|----|----|----|----|  
| Sales | 100 | 120 | 150 | 180 |

Implementation Details

  • Converter Class: PptxConverter (_pptx_converter.py)
  • Accepted Extensions: .pptx
  • MIME Types: application/vnd.openxmlformats-officedocument.presentationml
  • Shape Processing: Handles pictures, tables, charts, text frames, and grouped shapes
  • Layout Preservation: Shapes sorted by position (top to bottom, left to right)

Excel Spreadsheets (.xlsx, .xls)

Dependencies

pip install pandas openpyxl

Features

  • Multiple Sheets: Each sheet converted to a separate Markdown table
  • Sheet Names: Used as H2 headings
  • Data Preservation: All cell data maintained
  • Pandas Integration: Uses pandas for robust Excel parsing

Usage Example

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("spreadsheet.xlsx")
print(result.markdown)

Output Format

## Sheet1

| Column1 | Column2 | Column3 |
|---------|---------|---------|  
| Data1 | Data2 | Data3 |
| Data4 | Data5 | Data6 |

## Sheet2

| Product | Price | Quantity |
|---------|-------|----------|  
| Widget | $10 | 50 |
| Gadget | $20 | 30 |

Implementation Details

  • Converter Classes: XlsxConverter and XlsConverter (_xlsx_converter.py)
  • XLSX Extensions: .xlsx
  • XLS Extensions: .xls
  • Engines: openpyxl for XLSX, xlrd for XLS
  • Process: Excel → pandas DataFrame → HTML → Markdown

Outlook Messages (.msg)

Dependencies

pip install olefile

Features

  • Email Headers: From, To, Subject extracted
  • Message Body: Full email content preserved
  • OLE File Parsing: Uses olefile for .msg structure

Usage Example

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("email.msg")
print(result.markdown)
print(result.title)  # Email subject

Output Format

# Email Message

**From:** [email protected]
**To:** [email protected]  
**Subject:** Meeting Reminder

## Content

Hi Team,

Just a reminder about tomorrow's meeting at 2 PM.

Best regards,
Sender

Implementation Details

  • Converter Class: OutlookMsgConverter (_outlook_msg_converter.py)
  • Accepted Extensions: .msg
  • MIME Types: application/vnd.ms-outlook
  • Detection: Validates OLE file structure with __properties_version1.0 marker
  • Encoding: Handles UTF-16 LE and UTF-8 encodings

Common Options

All Office converters that handle images (DOCX, PPTX) support LLM-powered image captioning:
from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")

# Custom caption prompt
result = md.convert(
    "document.pptx",
    llm_prompt="Describe this image in detail, focusing on key elements."
)
For PowerPoint files, you can embed images directly in the Markdown:
result = md.convert("presentation.pptx", keep_data_uris=True)
# Images will be: ![alt](data:image/png;base64,...)
All converters raise MissingDependencyException when required libraries are not installed:
from markitdown import MarkItDown
from markitdown._exceptions import MissingDependencyException

md = MarkItDown()
try:
    result = md.convert("document.docx")
except MissingDependencyException as e:
    print(f"Missing dependency: {e}")
    print("Install with: pip install mammoth")

Source Code Reference

All Office document converters are located in:
packages/markitdown/src/markitdown/converters/
├── _docx_converter.py       # Word documents
├── _pptx_converter.py       # PowerPoint presentations
├── _xlsx_converter.py       # Excel spreadsheets (XLSX and XLS)
└── _outlook_msg_converter.py  # Outlook messages

Build docs developers (and LLMs) love