Web Content - MarkItDown

MarkItDown provides specialized converters for various web content types, extracting meaningful content while removing navigation, ads, and other non-essential elements.

Supported Formats

HTML Pages

Generic HTML documents and web pages

RSS/Atom Feeds

News feeds and blog syndication

Wikipedia

Wikipedia articles with clean extraction

YouTube

Video metadata and transcripts

Bing Search

Bing search results pages

HTML Pages

Dependencies

pip install beautifulsoup4

Features

Extracts main content from <body> tag
Removes <script> and <style> blocks
Converts HTML elements to Markdown equivalents
Preserves links, images, tables, and formatting
Extracts page title from <title> tag

Usage

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("page.html")
print(result.markdown)
print(result.title)  # Page title

Implementation

Converter Class: HtmlConverter (_html_converter.py)
Accepted Extensions: .html, .htm
MIME Types: text/html, application/xhtml
Markdown Engine: Custom _CustomMarkdownify based on markdownify library

RSS/Atom Feeds

Dependencies

pip install beautifulsoup4 defusedxml

Features

Supports both RSS 2.0 and Atom 1.0 formats
Extracts feed title and description
Converts each item/entry to a section
Parses HTML content within feed items
Preserves publication dates

Usage

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("feed.rss")
print(result.markdown)
print(result.title)  # Feed title

Output Format

RSS Feed:

# Tech Blog
Latest technology news and tutorials

## New Python 3.12 Features
Published on: Mon, 15 Feb 2024 10:00:00 GMT
Python 3.12 introduces several exciting new features including...

## Understanding Async Programming  
Published on: Sun, 14 Feb 2024 15:30:00 GMT
Asynchronous programming can be challenging...

Atom Feed:

# Personal Blog
Thoughts and tutorials

## My Journey Learning Rust
Updated on: 2024-02-15T10:00:00Z
Rust has been an interesting language to learn...

## Web Development Tips
Updated on: 2024-02-14T14:00:00Z  
Here are some tips for modern web development...

Implementation

Converter Class: RssConverter (_rss_converter.py)
Accepted Extensions: .rss, .atom, .xml
MIME Types: application/rss+xml, application/atom+xml, text/xml
Detection: Checks for <rss> or <feed> root elements
XML Parser: defusedxml.minidom (secure XML parsing)

Wikipedia

Dependencies

pip install beautifulsoup4

Features

Extracts main article content only
Removes navigation, sidebars, and infoboxes
Preserves article structure and formatting
Includes article title as H1 heading
Works with any language Wikipedia

Usage

import requests
import io
from markitdown import MarkItDown

md = MarkItDown()
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"

response = requests.get(url)
result = md.convert_stream(
    stream=io.BytesIO(response.content),
    url=url
)

print(result.markdown)
print(result.title)  # "Python (programming language)"

Output Format

# Python (programming language)

Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation.

## History

Python was conceived in the late 1980s by Guido van Rossum...

## Features and philosophy

Python is a multi-paradigm programming language...

Implementation

Converter Class: WikipediaConverter (_wikipedia_converter.py)
URL Pattern: https?://[a-z]{2,3}.wikipedia.org/
Content Selector: <div id="mw-content-text">
Title Selector: <span class="mw-page-title-main">

YouTube

Dependencies

pip install beautifulsoup4 youtube-transcript-api

Features

Extracts video title and description
Retrieves video metadata (views, keywords, duration)
Downloads video transcript in multiple languages
Falls back gracefully if transcript unavailable

Usage

import requests
import io
from markitdown import MarkItDown

md = MarkItDown()
url = "https://www.youtube.com/watch?v=dQw4w9WgXcQ"

response = requests.get(url)
result = md.convert_stream(
    stream=io.BytesIO(response.content),
    url=url
)

print(result.markdown)
print(result.title)  # Video title

Output Format

# YouTube

## Learn Python in 10 Minutes

### Video Metadata
- **Views:** 1,234,567
- **Keywords:** python, programming, tutorial
- **Runtime:** PT10M30S

### Description
In this video, we'll cover the basics of Python programming including variables, loops, and functions. Perfect for beginners!

### Transcript
Welcome to this Python tutorial. Today we're going to learn the basics of Python programming. Let's start with variables. A variable is a container for storing data values...

Implementation

Converter Class: YouTubeConverter (_youtube_converter.py)
URL Pattern: https://www.youtube.com/watch?v=*
Metadata Source: Meta tags and ytInitialData JSON
Transcript API: youtube-transcript-api library
Languages: Defaults to ["en"], customizable via youtube_transcript_languages

Transcript Options

# Specify preferred languages (in order of preference)
result = md.convert_stream(
    stream=response,
    url=youtube_url,
    youtube_transcript_languages=["es", "fr", "en"]
)

The converter will:

Try to fetch transcript in first language (Spanish)
Fall back to second language (French) if unavailable
Fall back to third language (English) if unavailable
Attempt auto-translation if no direct transcript exists

Bing Search Results

Dependencies

pip install beautifulsoup4

Features

Extracts organic search results
Removes ads and navigation
Decodes Bing redirect URLs to actual destination URLs
Preserves result snippets and links

Better Alternative: Using the Bing Search API directly is recommended over scraping. This converter is provided for convenience but may break if Bing changes their HTML structure.

Usage

import requests
import io
from markitdown import MarkItDown

md = MarkItDown()
query = "python programming"
url = f"https://www.bing.com/search?q={query}"

response = requests.get(url)
result = md.convert_stream(
    stream=io.BytesIO(response.content),
    url=url
)

print(result.markdown)

Output Format

## A Bing search for 'python programming' found the following results:

[**Python.org**](https://www.python.org/)
Welcome to Python.org. The official home of the Python Programming Language.

[**Learn Python - Free Interactive Python Tutorial**](https://www.learnpython.org/)
Learn Python, one of today's most in-demand programming languages...

[**Python Tutorial - W3Schools**](https://www.w3schools.com/python/)
Well organized and easy to understand Web building tutorials...

Implementation

Converter Class: BingSerpConverter (_bing_serp_converter.py)
URL Pattern: https://www.bing.com/search?q=*
Result Selector: Elements with class b_algo
URL Decoding: Base64 decodes redirect URLs from u parameter

Common Patterns

Fetching Web Content

import requests
import io
from markitdown import MarkItDown

def convert_url(url: str) -> str:
    """Fetch and convert any web URL to Markdown."""
    md = MarkItDown()
    
    response = requests.get(url, headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    })
    response.raise_for_status()
    
    result = md.convert_stream(
        stream=io.BytesIO(response.content),
        url=url
    )
    
    return result.markdown

# Usage
markdown = convert_url("https://en.wikipedia.org/wiki/Markdown")
print(markdown)

Processing Multiple URLs

import requests
import io
from markitdown import MarkItDown
from urllib.parse import urlparse

md = MarkItDown()
urls = [
    "https://en.wikipedia.org/wiki/Python_(programming_language)",
    "https://www.youtube.com/watch?v=example",
    "https://example.com/feed.xml"
]

for url in urls:
    response = requests.get(url)
    result = md.convert_stream(
        stream=io.BytesIO(response.content),
        url=url
    )
    
    # Save with domain as filename
    domain = urlparse(url).netloc.replace('www.', '')
    filename = f"{domain}.md"
    
    with open(filename, 'w') as f:
        f.write(f"# {result.title or 'Untitled'}\n\n")
        f.write(f"Source: {url}\n\n")
        f.write(result.markdown)
    
    print(f"Saved {filename}")

Error Handling

import requests
import io
from markitdown import MarkItDown
from markitdown._exceptions import MissingDependencyException

md = MarkItDown()

try:
    response = requests.get(url, timeout=10)
    response.raise_for_status()
    
    result = md.convert_stream(
        stream=io.BytesIO(response.content),
        url=url
    )
    print(result.markdown)
    
except requests.RequestException as e:
    print(f"Failed to fetch URL: {e}")
except MissingDependencyException as e:
    print(f"Missing dependency: {e}")
    print("Install with: pip install beautifulsoup4")
except Exception as e:
    print(f"Conversion error: {e}")

Implementation Notes

Converter Priority

MarkItDown checks converters in this order for web content:

WikipediaConverter - Checks URL matches Wikipedia domain
YouTubeConverter - Checks URL matches YouTube watch page
BingSerpConverter - Checks URL matches Bing search
RssConverter - Checks for RSS/Atom root elements
HtmlConverter - Generic fallback for all HTML

Source Files

packages/markitdown/src/markitdown/converters/
├── _html_converter.py       # Generic HTML
├── _rss_converter.py        # RSS/Atom feeds
├── _wikipedia_converter.py  # Wikipedia articles
├── _youtube_converter.py    # YouTube videos
├── _bing_serp_converter.py  # Bing search results
└── _markdownify.py          # HTML to Markdown conversion

Get Started

Guides

File Formats

Advanced

​Supported Formats

HTML Pages

RSS/Atom Feeds

Wikipedia

YouTube

Bing Search

​HTML Pages

​Dependencies

​Features

​Usage

​Implementation

​RSS/Atom Feeds

​Dependencies

​Features

​Usage

​Output Format

​Implementation

​Wikipedia

​Dependencies

​Features

​Usage

​Output Format

​Implementation

​YouTube

​Dependencies

​Features

​Usage

​Output Format

​Implementation

​Transcript Options

​Bing Search Results

​Dependencies

​Features

​Usage

​Output Format

​Implementation

​Common Patterns

​Fetching Web Content

​Processing Multiple URLs

​Error Handling

​Implementation Notes

​Converter Priority

​Source Files

​Next Steps

Other Formats

Python API

Build docs developers (and LLMs) love

Supported Formats

HTML Pages

Dependencies

Features

Usage

Implementation

RSS/Atom Feeds

Dependencies

Features

Usage

Output Format

Implementation

Wikipedia

Dependencies

Features

Usage

Output Format

Implementation

YouTube

Dependencies

Features

Usage

Output Format

Implementation

Transcript Options

Bing Search Results

Dependencies

Features

Usage

Output Format

Implementation

Common Patterns

Fetching Web Content

Processing Multiple URLs

Error Handling

Implementation Notes

Converter Priority

Source Files

Next Steps