Skip to main content
MarkItDown provides specialized converters for various web content types, extracting meaningful content while removing navigation, ads, and other non-essential elements.

Supported Formats

HTML Pages

Generic HTML documents and web pages

RSS/Atom Feeds

News feeds and blog syndication

Wikipedia

Wikipedia articles with clean extraction

YouTube

Video metadata and transcripts

Bing Search

Bing search results pages

HTML Pages

Dependencies

pip install beautifulsoup4

Features

  • Extracts main content from <body> tag
  • Removes <script> and <style> blocks
  • Converts HTML elements to Markdown equivalents
  • Preserves links, images, tables, and formatting
  • Extracts page title from <title> tag

Usage

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("page.html")
print(result.markdown)
print(result.title)  # Page title

Implementation

  • Converter Class: HtmlConverter (_html_converter.py)
  • Accepted Extensions: .html, .htm
  • MIME Types: text/html, application/xhtml
  • Markdown Engine: Custom _CustomMarkdownify based on markdownify library

RSS/Atom Feeds

Dependencies

pip install beautifulsoup4 defusedxml

Features

  • Supports both RSS 2.0 and Atom 1.0 formats
  • Extracts feed title and description
  • Converts each item/entry to a section
  • Parses HTML content within feed items
  • Preserves publication dates

Usage

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("feed.rss")
print(result.markdown)
print(result.title)  # Feed title

Output Format

RSS Feed:
# Tech Blog
Latest technology news and tutorials

## New Python 3.12 Features
Published on: Mon, 15 Feb 2024 10:00:00 GMT
Python 3.12 introduces several exciting new features including...

## Understanding Async Programming  
Published on: Sun, 14 Feb 2024 15:30:00 GMT
Asynchronous programming can be challenging...
Atom Feed:
# Personal Blog
Thoughts and tutorials

## My Journey Learning Rust
Updated on: 2024-02-15T10:00:00Z
Rust has been an interesting language to learn...

## Web Development Tips
Updated on: 2024-02-14T14:00:00Z  
Here are some tips for modern web development...

Implementation

  • Converter Class: RssConverter (_rss_converter.py)
  • Accepted Extensions: .rss, .atom, .xml
  • MIME Types: application/rss+xml, application/atom+xml, text/xml
  • Detection: Checks for <rss> or <feed> root elements
  • XML Parser: defusedxml.minidom (secure XML parsing)

Wikipedia

Dependencies

pip install beautifulsoup4

Features

  • Extracts main article content only
  • Removes navigation, sidebars, and infoboxes
  • Preserves article structure and formatting
  • Includes article title as H1 heading
  • Works with any language Wikipedia

Usage

import requests
import io
from markitdown import MarkItDown

md = MarkItDown()
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"

response = requests.get(url)
result = md.convert_stream(
    stream=io.BytesIO(response.content),
    url=url
)

print(result.markdown)
print(result.title)  # "Python (programming language)"

Output Format

# Python (programming language)

Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation.

## History

Python was conceived in the late 1980s by Guido van Rossum...

## Features and philosophy

Python is a multi-paradigm programming language...

Implementation

  • Converter Class: WikipediaConverter (_wikipedia_converter.py)
  • URL Pattern: https?://[a-z]{2,3}.wikipedia.org/
  • Content Selector: <div id="mw-content-text">
  • Title Selector: <span class="mw-page-title-main">

YouTube

Dependencies

pip install beautifulsoup4 youtube-transcript-api

Features

  • Extracts video title and description
  • Retrieves video metadata (views, keywords, duration)
  • Downloads video transcript in multiple languages
  • Falls back gracefully if transcript unavailable

Usage

import requests
import io
from markitdown import MarkItDown

md = MarkItDown()
url = "https://www.youtube.com/watch?v=dQw4w9WgXcQ"

response = requests.get(url)
result = md.convert_stream(
    stream=io.BytesIO(response.content),
    url=url
)

print(result.markdown)
print(result.title)  # Video title

Output Format

# YouTube

## Learn Python in 10 Minutes

### Video Metadata
- **Views:** 1,234,567
- **Keywords:** python, programming, tutorial
- **Runtime:** PT10M30S

### Description
In this video, we'll cover the basics of Python programming including variables, loops, and functions. Perfect for beginners!

### Transcript
Welcome to this Python tutorial. Today we're going to learn the basics of Python programming. Let's start with variables. A variable is a container for storing data values...

Implementation

  • Converter Class: YouTubeConverter (_youtube_converter.py)
  • URL Pattern: https://www.youtube.com/watch?v=*
  • Metadata Source: Meta tags and ytInitialData JSON
  • Transcript API: youtube-transcript-api library
  • Languages: Defaults to ["en"], customizable via youtube_transcript_languages

Transcript Options

# Specify preferred languages (in order of preference)
result = md.convert_stream(
    stream=response,
    url=youtube_url,
    youtube_transcript_languages=["es", "fr", "en"]
)
The converter will:
  1. Try to fetch transcript in first language (Spanish)
  2. Fall back to second language (French) if unavailable
  3. Fall back to third language (English) if unavailable
  4. Attempt auto-translation if no direct transcript exists

Bing Search Results

Dependencies

pip install beautifulsoup4

Features

  • Extracts organic search results
  • Removes ads and navigation
  • Decodes Bing redirect URLs to actual destination URLs
  • Preserves result snippets and links
Better Alternative: Using the Bing Search API directly is recommended over scraping. This converter is provided for convenience but may break if Bing changes their HTML structure.

Usage

import requests
import io
from markitdown import MarkItDown

md = MarkItDown()
query = "python programming"
url = f"https://www.bing.com/search?q={query}"

response = requests.get(url)
result = md.convert_stream(
    stream=io.BytesIO(response.content),
    url=url
)

print(result.markdown)

Output Format

## A Bing search for 'python programming' found the following results:

[**Python.org**](https://www.python.org/)
Welcome to Python.org. The official home of the Python Programming Language.

[**Learn Python - Free Interactive Python Tutorial**](https://www.learnpython.org/)
Learn Python, one of today's most in-demand programming languages...

[**Python Tutorial - W3Schools**](https://www.w3schools.com/python/)
Well organized and easy to understand Web building tutorials...

Implementation

  • Converter Class: BingSerpConverter (_bing_serp_converter.py)
  • URL Pattern: https://www.bing.com/search?q=*
  • Result Selector: Elements with class b_algo
  • URL Decoding: Base64 decodes redirect URLs from u parameter

Common Patterns

Fetching Web Content

import requests
import io
from markitdown import MarkItDown

def convert_url(url: str) -> str:
    """Fetch and convert any web URL to Markdown."""
    md = MarkItDown()
    
    response = requests.get(url, headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    })
    response.raise_for_status()
    
    result = md.convert_stream(
        stream=io.BytesIO(response.content),
        url=url
    )
    
    return result.markdown

# Usage
markdown = convert_url("https://en.wikipedia.org/wiki/Markdown")
print(markdown)

Processing Multiple URLs

import requests
import io
from markitdown import MarkItDown
from urllib.parse import urlparse

md = MarkItDown()
urls = [
    "https://en.wikipedia.org/wiki/Python_(programming_language)",
    "https://www.youtube.com/watch?v=example",
    "https://example.com/feed.xml"
]

for url in urls:
    response = requests.get(url)
    result = md.convert_stream(
        stream=io.BytesIO(response.content),
        url=url
    )
    
    # Save with domain as filename
    domain = urlparse(url).netloc.replace('www.', '')
    filename = f"{domain}.md"
    
    with open(filename, 'w') as f:
        f.write(f"# {result.title or 'Untitled'}\n\n")
        f.write(f"Source: {url}\n\n")
        f.write(result.markdown)
    
    print(f"Saved {filename}")

Error Handling

import requests
import io
from markitdown import MarkItDown
from markitdown._exceptions import MissingDependencyException

md = MarkItDown()

try:
    response = requests.get(url, timeout=10)
    response.raise_for_status()
    
    result = md.convert_stream(
        stream=io.BytesIO(response.content),
        url=url
    )
    print(result.markdown)
    
except requests.RequestException as e:
    print(f"Failed to fetch URL: {e}")
except MissingDependencyException as e:
    print(f"Missing dependency: {e}")
    print("Install with: pip install beautifulsoup4")
except Exception as e:
    print(f"Conversion error: {e}")

Implementation Notes

Converter Priority

MarkItDown checks converters in this order for web content:
  1. WikipediaConverter - Checks URL matches Wikipedia domain
  2. YouTubeConverter - Checks URL matches YouTube watch page
  3. BingSerpConverter - Checks URL matches Bing search
  4. RssConverter - Checks for RSS/Atom root elements
  5. HtmlConverter - Generic fallback for all HTML

Source Files

packages/markitdown/src/markitdown/converters/
├── _html_converter.py       # Generic HTML
├── _rss_converter.py        # RSS/Atom feeds
├── _wikipedia_converter.py  # Wikipedia articles
├── _youtube_converter.py    # YouTube videos
├── _bing_serp_converter.py  # Bing search results
└── _markdownify.py          # HTML to Markdown conversion

Next Steps

Other Formats

CSV, JSON, XML, ZIP, EPUB, Jupyter notebooks

Python API

Learn more about the programmatic interface

Build docs developers (and LLMs) love