Skip to main content
MarkItDown Hero Light

What is MarkItDown?

MarkItDown is a lightweight Python utility for converting various file formats to Markdown, specifically designed for use with Large Language Models (LLMs) and text analysis pipelines. Built by the Microsoft AutoGen team, it preserves important document structure and content while producing clean, LLM-friendly Markdown output. While the output is often reasonably presentable and human-friendly, MarkItDown is optimized for consumption by text analysis tools—not for high-fidelity document conversions for human readers.

Quick start

Get up and running in minutes with your first conversion

Installation

Detailed setup instructions for all environments

Python API

Integrate MarkItDown into your applications

CLI reference

Command-line interface documentation

Supported formats

MarkItDown currently supports conversion from a wide range of file types:

Documents

  • PDF files
  • Word documents (.docx)
  • PowerPoint (.pptx)
  • Excel spreadsheets (.xlsx, .xls)
  • EPub books

Media

  • Images (JPEG, PNG)
  • EXIF metadata extraction
  • OCR via LLM integration
  • Audio transcription

Web & text

  • HTML pages
  • YouTube videos
  • Wikipedia articles
  • CSV, JSON, XML
  • ZIP archives
MarkItDown can also convert Outlook messages, Jupyter notebooks, RSS feeds, and more.

Key features

MarkItDown maintains important document structure including:
  • Headings and hierarchy
  • Lists (ordered and unordered)
  • Tables with proper formatting
  • Links and references
  • Code blocks and formatting
Convert documents from multiple sources:
  • Local file paths
  • URLs (HTTP/HTTPS)
  • File URIs
  • Data URIs (base64 encoded)
  • Binary streams and file-like objects
  • HTTP Response objects
Enhance conversions with AI:
  • Image description via GPT-4o or other multimodal models
  • Custom prompts for specialized output
  • Optimized for token efficiency
Use Microsoft’s Document Intelligence service for advanced PDF and document processing with superior accuracy and layout understanding.
  • Plugin system for custom converters
  • Priority-based converter registration
  • Custom document converter support
  • Modular optional dependencies
Automatic format detection using:
  • MIME type analysis
  • File extension matching
  • Content-based detection with Magika
  • Charset normalization

Why Markdown for LLMs?

Markdown is the ideal format for LLM consumption and here’s why:

Natural language alignment

Markdown is extremely close to plain text with minimal markup, making it easy for both humans and AI models to parse and understand.

Native LLM support

Mainstream LLMs like OpenAI’s GPT-4o natively “speak” Markdown and often incorporate it into their responses unprompted. This suggests they have been trained on vast amounts of Markdown-formatted text.

Token efficiency

Markdown conventions are highly token-efficient compared to HTML or other markup languages. Less tokens means:
  • Lower API costs
  • Faster processing
  • Ability to fit more content in context windows

Structure preservation

Unlike plain text, Markdown preserves document structure (headings, lists, tables) that helps LLMs understand document organization and relationships between content.
# Heading
- List item 1
- List item 2

**Bold text** and *italic text*

MCP server integration

MarkItDown offers an MCP (Model Context Protocol) server for seamless integration with LLM applications like Claude Desktop. See markitdown-mcp for more information.
The MCP server allows AI assistants to convert documents on-the-fly, enabling powerful document analysis workflows directly within your AI chat interface.

Get started

1

Install MarkItDown

pip install 'markitdown[all]'
2

Convert your first document

markitdown document.pdf > output.md
3

Explore the API

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)

Ready to start?

Follow the quickstart guide to convert your first document

Build docs developers (and LLMs) love