Other Formats - MarkItDown

MarkItDown supports various additional file formats for data, archives, and structured documents.

Supported Formats

CSV

Comma-separated values

JSON

JSON and JSONL files

XML

XML documents via RSS parser

ZIP

Archive file contents

EPUB

E-book format

Jupyter

IPython notebooks

CSV Files

Dependencies

No external dependencies - uses Python’s built-in csv module.

Features

Converts CSV to Markdown tables
First row treated as header
Automatic encoding detection with charset-normalizer
Handles irregular row lengths

Usage

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("data.csv")
print(result.markdown)

Example

Input CSV:

Name,Age,City
John Doe,30,New York
Jane Smith,25,Los Angeles
Bob Johnson,35,Chicago

Output Markdown:

| Name | Age | City |
| --- | --- | --- |
| John Doe | 30 | New York |
| Jane Smith | 25 | Los Angeles |
| Bob Johnson | 35 | Chicago |

Implementation

Converter Class: CsvConverter (_csv_converter.py)
Accepted Extensions: .csv
MIME Types: text/csv, application/csv
Encoding: Uses charset-normalizer if charset not specified

JSON Files

Dependencies

No external dependencies - uses Python’s built-in json module.

Features

Plain text output (not converted to structured Markdown)
Preserves JSON formatting
Supports both .json and .jsonl files

Usage

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("data.json")
print(result.markdown)  # Raw JSON text

Implementation

Converter Class: PlainTextConverter (_plain_text_converter.py)
Accepted Extensions: .json, .jsonl
MIME Types: application/json
Processing: Treated as plain text, no special JSON parsing

JSON files are processed as plain text. For structured JSON-to-Markdown conversion, consider pre-processing with jq or a custom script.

XML Files

Dependencies

pip install beautifulsoup4 defusedxml

Features

XML files are processed by the RSS converter
Only RSS/Atom feeds are specially formatted
Other XML treated as plain text

Usage

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("feed.xml")
print(result.markdown)  # Formatted feed content

Implementation

RSS/Atom XML: RssConverter (_rss_converter.py) - see Web Content
Other XML: PlainTextConverter (_plain_text_converter.py)

ZIP Archives

Dependencies

No external dependencies - uses Python’s built-in zipfile module.

Features

Extracts and converts each file in the archive
Recursively processes nested files
Skips unsupported formats silently
Preserves file paths in output

Usage

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("archive.zip")
print(result.markdown)

Output Format

Content from the zip file `archive.zip`:

## File: README.md

# Project Title

Project description here...

## File: docs/guide.docx

# User Guide

Welcome to the user guide...

## File: data/results.csv

| Name | Value | Status |
|------|-------|--------|
| Test1 | 42 | Pass |
| Test2 | 37 | Pass |

## File: images/logo.png

ImageSize: 512x512
DateTimeOriginal: 2024:02:15 10:00:00

Implementation

Converter Class: ZipConverter (_zip_converter.py)
Accepted Extensions: .zip
MIME Types: application/zip
Processing: Each file converted independently using appropriate converter

Advanced Example

from markitdown import MarkItDown
from openai import OpenAI

# ZIP with images - use LLM for image descriptions
client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    exiftool_path="/usr/local/bin/exiftool"
)

result = md.convert("photos.zip")
print(result.markdown)  # Includes image descriptions

EPUB Books

Dependencies

pip install beautifulsoup4 defusedxml

Features

Extracts book metadata (title, authors, publisher, etc.)
Converts chapters in reading order
Preserves chapter structure
Handles XHTML content

Usage

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("book.epub")
print(result.title)  # Book title
print(result.markdown)  # Full book content

Output Format

**Title:** The Great Novel
**Authors:** Jane Author, John Writer
**Publisher:** Example Press
**Date:** 2024-01-15
**Language:** en
**Description:** An exciting tale of adventure and discovery.
**Identifier:** ISBN:978-0-123456-78-9

# Chapter 1: The Beginning

It was a dark and stormy night...

# Chapter 2: The Journey

The next morning, our hero set out...

Implementation

Converter Class: EpubConverter (_epub_converter.py)
Accepted Extensions: .epub
MIME Types: application/epub, application/epub+zip, application/x-epub+zip
Processing:
1. Extract META-INF/container.xml to find content.opf
2. Parse metadata from content.opf
3. Read spine order from content.opf
4. Convert each chapter (XHTML) to Markdown
5. Combine with metadata

Metadata Fields

Field	Description	Example
`title`	Book title	`The Great Gatsby`
`authors`	Author(s)	`F. Scott Fitzgerald`
`language`	Language code	`en`, `es`, `fr`
`publisher`	Publisher name	`Scribner`
`date`	Publication date	`1925-04-10`
`description`	Book description	`A novel set in...`
`identifier`	ISBN or other ID	`ISBN:978-0-...`

Jupyter Notebooks

Dependencies

No external dependencies - uses Python’s built-in json module.

Features

Converts .ipynb files to Markdown
Preserves code cells in code blocks
Includes markdown cells directly
Extracts first H1 heading as title
Handles raw cells

Usage

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("analysis.ipynb")
print(result.title)  # First H1 heading
print(result.markdown)

Output Format

Input Notebook:

{
  "cells": [
    {
      "cell_type": "markdown",
      "source": ["# Data Analysis\n", "This notebook analyzes..."]
    },
    {
      "cell_type": "code",
      "source": ["import pandas as pd\n", "df = pd.read_csv('data.csv')"]
    },
    {
      "cell_type": "markdown",
      "source": ["## Results\n", "The analysis shows..."]
    }
  ]
}

Output Markdown:

# Data Analysis
This notebook analyzes...

```python
import pandas as pd
df = pd.read_csv('data.csv')

Results

The analysis shows…

### Implementation

- **Converter Class**: `IpynbConverter` (`_ipynb_converter.py`)
- **Accepted Extensions**: `.ipynb`
- **MIME Types**: `application/json` (with notebook content detection)
- **Cell Types**:
  - `markdown`: Included directly
  - `code`: Wrapped in ````python` code blocks
  - `raw`: Wrapped in ```` code blocks

### Advanced Example

```python
from markitdown import MarkItDown
import json

# Convert and extract title
md = MarkItDown()
result = md.convert("analysis.ipynb")

if result.title:
    print(f"Notebook: {result.title}")
    
# Count code cells
with open("analysis.ipynb") as f:
    notebook = json.load(f)
    code_cells = sum(1 for cell in notebook['cells'] if cell['cell_type'] == 'code')
    print(f"Code cells: {code_cells}")

# Save markdown
with open("analysis.md", "w") as f:
    f.write(result.markdown)

Plain Text Files

Dependencies

No external dependencies.

Features

Supports .txt, .text, .md, .markdown extensions
Automatic encoding detection
Preserves content as-is

Usage

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.txt")
print(result.markdown)  # Original text content

Implementation

Converter Class: PlainTextConverter (_plain_text_converter.py)
Accepted Extensions: .txt, .text, .md, .markdown, .json, .jsonl
MIME Types: text/*, application/json, application/markdown
Encoding: Uses charset-normalizer if charset not specified

Comparison Table

Format	Dependencies	Structured Output	Metadata Support	Nested Content
CSV	None	✓ (Tables)	✗	✗
JSON	None	✗ (Plain text)	✗	✗
XML	BeautifulSoup, defusedxml	✓ (RSS only)	✓ (RSS only)	✗
ZIP	None	✓ (Per file)	✗	✓
EPUB	BeautifulSoup, defusedxml	✓ (Chapters)	✓	✓
Jupyter	None	✓ (Cells)	✓	✗
Plain Text	None	✗	✗	✗

Common Patterns

Processing Archive Contents

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("project.zip")

# Extract individual file sections
sections = result.markdown.split("## File: ")
for section in sections[1:]:  # Skip header
    lines = section.split('\n')
    filename = lines[0]
    content = '\n'.join(lines[1:])
    print(f"File: {filename}")
    print(f"Content length: {len(content)} characters")

CSV to Formatted Report

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("sales_data.csv")

# Wrap in document
report = f"""# Sales Report

Generated: {datetime.now().strftime('%Y-%m-%d')}

## Data

{result.markdown}

## Summary

Total entries: {result.markdown.count('|') - 2}
"""

with open("report.md", "w") as f:
    f.write(report)

Book Chapter Extraction

from markitdown import MarkItDown
import re

md = MarkItDown()
result = md.convert("book.epub")

# Split by chapter headings
chapters = re.split(r'\n# ', result.markdown)

for i, chapter in enumerate(chapters[1:], 1):  # Skip metadata
    lines = chapter.split('\n')
    title = lines[0]
    content = '\n'.join(lines[1:])
    
    # Save each chapter separately
    with open(f"chapter_{i:02d}_{title.replace(' ', '_')}.md", "w") as f:
        f.write(f"# {title}\n\n{content}")

Error Handling

from markitdown import MarkItDown
from markitdown._exceptions import (
    MissingDependencyException,
    UnsupportedFormatException,
    FileConversionException
)

md = MarkItDown()

try:
    result = md.convert("file.ext")
    print(result.markdown)
except MissingDependencyException as e:
    print(f"Missing dependency: {e}")
    print("Install with: pip install markitdown[all]")
except UnsupportedFormatException as e:
    print(f"Unsupported format: {e}")
except FileConversionException as e:
    print(f"Conversion failed: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

Source Code Reference

packages/markitdown/src/markitdown/converters/
├── _csv_converter.py           # CSV tables
├── _plain_text_converter.py    # JSON, TXT, MD
├── _rss_converter.py           # XML/RSS/Atom
├── _zip_converter.py           # ZIP archives
├── _epub_converter.py          # EPUB books
└── _ipynb_converter.py         # Jupyter notebooks

Get Started

Guides

File Formats

Advanced

​Supported Formats

CSV

JSON

XML

ZIP

EPUB

Jupyter

​CSV Files

​Dependencies

​Features

​Usage

​Example

​Implementation

​JSON Files

​Dependencies

​Features

​Usage

​Implementation

​XML Files

​Dependencies

​Features

​Usage

​Implementation

​ZIP Archives

​Dependencies

​Features

​Usage

​Output Format

​Implementation

​Advanced Example

​EPUB Books

​Dependencies

​Features

​Usage

​Output Format

​Implementation

​Metadata Fields

​Jupyter Notebooks

​Dependencies

​Features

​Usage

​Output Format

​Results

​Plain Text Files

​Dependencies

​Features

​Usage

​Implementation

​Comparison Table

​Common Patterns

​Processing Archive Contents

​CSV to Formatted Report

​Book Chapter Extraction

​Error Handling

​Source Code Reference

​Next Steps

Format Overview

Python API

Build docs developers (and LLMs) love

Supported Formats

CSV Files

Dependencies

Features

Usage

Example

Implementation

JSON Files

Dependencies

Features

Usage

Implementation

XML Files

Dependencies

Features

Usage

Implementation

ZIP Archives

Dependencies

Features

Usage

Output Format

Implementation

Advanced Example

EPUB Books

Dependencies

Features

Usage

Output Format

Implementation

Metadata Fields

Jupyter Notebooks

Dependencies

Features

Usage

Output Format

Results

Plain Text Files

Dependencies

Features

Usage

Implementation

Comparison Table

Common Patterns

Processing Archive Contents

CSV to Formatted Report

Book Chapter Extraction

Error Handling

Source Code Reference

Next Steps