Skip to main content
MarkItDown supports various additional file formats for data, archives, and structured documents.

Supported Formats

CSV

Comma-separated values

JSON

JSON and JSONL files

XML

XML documents via RSS parser

ZIP

Archive file contents

EPUB

E-book format

Jupyter

IPython notebooks

CSV Files

Dependencies

No external dependencies - uses Python’s built-in csv module.

Features

  • Converts CSV to Markdown tables
  • First row treated as header
  • Automatic encoding detection with charset-normalizer
  • Handles irregular row lengths

Usage

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("data.csv")
print(result.markdown)

Example

Input CSV:
Name,Age,City
John Doe,30,New York
Jane Smith,25,Los Angeles
Bob Johnson,35,Chicago
Output Markdown:
| Name | Age | City |
| --- | --- | --- |
| John Doe | 30 | New York |
| Jane Smith | 25 | Los Angeles |
| Bob Johnson | 35 | Chicago |

Implementation

  • Converter Class: CsvConverter (_csv_converter.py)
  • Accepted Extensions: .csv
  • MIME Types: text/csv, application/csv
  • Encoding: Uses charset-normalizer if charset not specified

JSON Files

Dependencies

No external dependencies - uses Python’s built-in json module.

Features

  • Plain text output (not converted to structured Markdown)
  • Preserves JSON formatting
  • Supports both .json and .jsonl files

Usage

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("data.json")
print(result.markdown)  # Raw JSON text

Implementation

  • Converter Class: PlainTextConverter (_plain_text_converter.py)
  • Accepted Extensions: .json, .jsonl
  • MIME Types: application/json
  • Processing: Treated as plain text, no special JSON parsing
JSON files are processed as plain text. For structured JSON-to-Markdown conversion, consider pre-processing with jq or a custom script.

XML Files

Dependencies

pip install beautifulsoup4 defusedxml

Features

  • XML files are processed by the RSS converter
  • Only RSS/Atom feeds are specially formatted
  • Other XML treated as plain text

Usage

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("feed.xml")
print(result.markdown)  # Formatted feed content

Implementation

  • RSS/Atom XML: RssConverter (_rss_converter.py) - see Web Content
  • Other XML: PlainTextConverter (_plain_text_converter.py)

ZIP Archives

Dependencies

No external dependencies - uses Python’s built-in zipfile module.

Features

  • Extracts and converts each file in the archive
  • Recursively processes nested files
  • Skips unsupported formats silently
  • Preserves file paths in output

Usage

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("archive.zip")
print(result.markdown)

Output Format

Content from the zip file `archive.zip`:

## File: README.md

# Project Title

Project description here...

## File: docs/guide.docx

# User Guide

Welcome to the user guide...

## File: data/results.csv

| Name | Value | Status |
|------|-------|--------|
| Test1 | 42 | Pass |
| Test2 | 37 | Pass |

## File: images/logo.png

ImageSize: 512x512
DateTimeOriginal: 2024:02:15 10:00:00

Implementation

  • Converter Class: ZipConverter (_zip_converter.py)
  • Accepted Extensions: .zip
  • MIME Types: application/zip
  • Processing: Each file converted independently using appropriate converter

Advanced Example

from markitdown import MarkItDown
from openai import OpenAI

# ZIP with images - use LLM for image descriptions
client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    exiftool_path="/usr/local/bin/exiftool"
)

result = md.convert("photos.zip")
print(result.markdown)  # Includes image descriptions

EPUB Books

Dependencies

pip install beautifulsoup4 defusedxml

Features

  • Extracts book metadata (title, authors, publisher, etc.)
  • Converts chapters in reading order
  • Preserves chapter structure
  • Handles XHTML content

Usage

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("book.epub")
print(result.title)  # Book title
print(result.markdown)  # Full book content

Output Format

**Title:** The Great Novel
**Authors:** Jane Author, John Writer
**Publisher:** Example Press
**Date:** 2024-01-15
**Language:** en
**Description:** An exciting tale of adventure and discovery.
**Identifier:** ISBN:978-0-123456-78-9

# Chapter 1: The Beginning

It was a dark and stormy night...

# Chapter 2: The Journey

The next morning, our hero set out...

Implementation

  • Converter Class: EpubConverter (_epub_converter.py)
  • Accepted Extensions: .epub
  • MIME Types: application/epub, application/epub+zip, application/x-epub+zip
  • Processing:
    1. Extract META-INF/container.xml to find content.opf
    2. Parse metadata from content.opf
    3. Read spine order from content.opf
    4. Convert each chapter (XHTML) to Markdown
    5. Combine with metadata

Metadata Fields

FieldDescriptionExample
titleBook titleThe Great Gatsby
authorsAuthor(s)F. Scott Fitzgerald
languageLanguage codeen, es, fr
publisherPublisher nameScribner
datePublication date1925-04-10
descriptionBook descriptionA novel set in...
identifierISBN or other IDISBN:978-0-...

Jupyter Notebooks

Dependencies

No external dependencies - uses Python’s built-in json module.

Features

  • Converts .ipynb files to Markdown
  • Preserves code cells in code blocks
  • Includes markdown cells directly
  • Extracts first H1 heading as title
  • Handles raw cells

Usage

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("analysis.ipynb")
print(result.title)  # First H1 heading
print(result.markdown)

Output Format

Input Notebook:
{
  "cells": [
    {
      "cell_type": "markdown",
      "source": ["# Data Analysis\n", "This notebook analyzes..."]
    },
    {
      "cell_type": "code",
      "source": ["import pandas as pd\n", "df = pd.read_csv('data.csv')"]
    },
    {
      "cell_type": "markdown",
      "source": ["## Results\n", "The analysis shows..."]
    }
  ]
}
Output Markdown:
# Data Analysis
This notebook analyzes...

```python
import pandas as pd
df = pd.read_csv('data.csv')

Results

The analysis shows…

### Implementation

- **Converter Class**: `IpynbConverter` (`_ipynb_converter.py`)
- **Accepted Extensions**: `.ipynb`
- **MIME Types**: `application/json` (with notebook content detection)
- **Cell Types**:
  - `markdown`: Included directly
  - `code`: Wrapped in ````python` code blocks
  - `raw`: Wrapped in ```` code blocks

### Advanced Example

```python
from markitdown import MarkItDown
import json

# Convert and extract title
md = MarkItDown()
result = md.convert("analysis.ipynb")

if result.title:
    print(f"Notebook: {result.title}")
    
# Count code cells
with open("analysis.ipynb") as f:
    notebook = json.load(f)
    code_cells = sum(1 for cell in notebook['cells'] if cell['cell_type'] == 'code')
    print(f"Code cells: {code_cells}")

# Save markdown
with open("analysis.md", "w") as f:
    f.write(result.markdown)

Plain Text Files

Dependencies

No external dependencies.

Features

  • Supports .txt, .text, .md, .markdown extensions
  • Automatic encoding detection
  • Preserves content as-is

Usage

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.txt")
print(result.markdown)  # Original text content

Implementation

  • Converter Class: PlainTextConverter (_plain_text_converter.py)
  • Accepted Extensions: .txt, .text, .md, .markdown, .json, .jsonl
  • MIME Types: text/*, application/json, application/markdown
  • Encoding: Uses charset-normalizer if charset not specified

Comparison Table

FormatDependenciesStructured OutputMetadata SupportNested Content
CSVNone✓ (Tables)
JSONNone✗ (Plain text)
XMLBeautifulSoup, defusedxml✓ (RSS only)✓ (RSS only)
ZIPNone✓ (Per file)
EPUBBeautifulSoup, defusedxml✓ (Chapters)
JupyterNone✓ (Cells)
Plain TextNone

Common Patterns

Processing Archive Contents

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("project.zip")

# Extract individual file sections
sections = result.markdown.split("## File: ")
for section in sections[1:]:  # Skip header
    lines = section.split('\n')
    filename = lines[0]
    content = '\n'.join(lines[1:])
    print(f"File: {filename}")
    print(f"Content length: {len(content)} characters")

CSV to Formatted Report

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("sales_data.csv")

# Wrap in document
report = f"""# Sales Report

Generated: {datetime.now().strftime('%Y-%m-%d')}

## Data

{result.markdown}

## Summary

Total entries: {result.markdown.count('|') - 2}
"""

with open("report.md", "w") as f:
    f.write(report)

Book Chapter Extraction

from markitdown import MarkItDown
import re

md = MarkItDown()
result = md.convert("book.epub")

# Split by chapter headings
chapters = re.split(r'\n# ', result.markdown)

for i, chapter in enumerate(chapters[1:], 1):  # Skip metadata
    lines = chapter.split('\n')
    title = lines[0]
    content = '\n'.join(lines[1:])
    
    # Save each chapter separately
    with open(f"chapter_{i:02d}_{title.replace(' ', '_')}.md", "w") as f:
        f.write(f"# {title}\n\n{content}")

Error Handling

from markitdown import MarkItDown
from markitdown._exceptions import (
    MissingDependencyException,
    UnsupportedFormatException,
    FileConversionException
)

md = MarkItDown()

try:
    result = md.convert("file.ext")
    print(result.markdown)
except MissingDependencyException as e:
    print(f"Missing dependency: {e}")
    print("Install with: pip install markitdown[all]")
except UnsupportedFormatException as e:
    print(f"Unsupported format: {e}")
except FileConversionException as e:
    print(f"Conversion failed: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

Source Code Reference

packages/markitdown/src/markitdown/converters/
├── _csv_converter.py           # CSV tables
├── _plain_text_converter.py    # JSON, TXT, MD
├── _rss_converter.py           # XML/RSS/Atom
├── _zip_converter.py           # ZIP archives
├── _epub_converter.py          # EPUB books
└── _ipynb_converter.py         # Jupyter notebooks

Next Steps

Format Overview

See all supported formats

Python API

Learn the programmatic interface

Build docs developers (and LLMs) love