Skip to main content

Docs Eater

๐Ÿ”ฎ Documentation Site Scraper for AI Upload A powerful Python CLI tool that scrapes entire documentation sites and outputs clean markdown files ready for uploading to Gemini, Claude, or any AI for personalized walkthroughs.

Overview

Docs Eater intelligently crawls documentation websites, extracts main content, converts it to clean markdown, and creates both individual files and a single combined document perfect for AI context.

Smart Crawling

Automatically discovers and follows documentation links

Clean Output

Removes navigation, sidebars, and clutter - keeps only content

AI-Ready Format

Single combined file with frontmatter and table of contents

Configurable

Control max pages, delays, output format, and more

Features

Intelligent Content Extraction

Docs Eater knows how to find the actual content:
  • Removes unwanted elements (nav, header, footer, sidebar)
  • Tries multiple content selectors (main, article, .content, etc.)
  • Extracts clean page titles
  • Preserves code blocks with language detection
  • Maintains heading hierarchy
1

Start with Base URL

Begins crawling from your provided documentation URL
2

Extract Links

Finds all links on each page
3

Filter Relevance

Skips non-doc patterns (blog, login, PDFs, images, etc.)
4

Stay in Scope

Only follows links within the same domain and base path
5

Respect Limits

Stops at max pages to avoid overwhelming API limits

Multiple Output Formats

_COMPLETE_DOCS.md - Single file with:
  • Metadata header (source URL, scrape time, page count)
  • Full table of contents
  • All pages concatenated with separators
  • Perfect for uploading to Gemini/Claude

Usage

Basic Usage

python docs_eater.py https://docs.anthropic.com

Command-Line Options

OptionShortDescriptionDefault
url-Base URL of the documentation siteRequired
--output-oOutput directorydocs_<domain>_<timestamp>
--max-pages-mMaximum pages to scrape100
--delay-dDelay between requests (seconds)0.5

Examples

python docs_eater.py https://docs.anthropic.com
Output:
docs_docs.anthropic.com_20260305_103045/
โ”œโ”€โ”€ _COMPLETE_DOCS.md
โ”œโ”€โ”€ _manifest.json
โ”œโ”€โ”€ index.md
โ”œโ”€โ”€ api_reference.md
โ”œโ”€โ”€ claude_quickstart.md
โ””โ”€โ”€ ...
python docs_eater.py https://docs.stripe.com/api --max-pages 50
Scrapes up to 50 pages from Stripeโ€™s API documentation.
python docs_eater.py https://fastapi.tiangolo.com -o fastapi_docs -m 75
Creates fastapi_docs/ directory with up to 75 pages.

Installation

1

Install Dependencies

pip install requests beautifulsoup4 markdownify
2

Download Script

Save docs_eater.py to your tools directory
3

Make Executable (Optional)

chmod +x docs_eater.py
4

Run

python docs_eater.py <url>

How It Works

Architecture

class DocsEater:
    def __init__(self, base_url, output_dir, max_pages, delay):
        # Initialize crawler with settings
        
    def normalize_url(self, url):
        # Remove fragments, trailing slashes for deduplication
        
    def is_valid_docs_url(self, url):
        # Check if URL should be scraped
        # - Same domain?
        # - Under base path?
        # - Not a blog/login/asset?
        
    def extract_content(self, soup, url):
        # Find main content area
        # Remove nav, header, footer, aside
        # Return clean HTML
        
    def html_to_markdown(self, html_content, title, url):
        # Convert HTML to markdown
        # Add frontmatter
        # Clean up formatting
        
    def scrape_page(self, url):
        # Fetch page
        # Extract title and content
        # Find new links
        # Return results
        
    def crawl(self):
        # BFS crawl through pages
        # Track visited URLs
        # Respect max_pages limit
        # Save as we go

Content Extraction

The tool tries multiple selectors to find main content:
content_selectors = [
    'main',
    'article',
    '[role="main"]',
    '.content',
    '.main-content',
    '.documentation',
    '.docs-content',
    '.markdown-body',
    '.prose',
    '#content',
    '#main',
    '.article-content',
]
Fallback to <body> if none found.

Skip Patterns

Automatically skips:
skip_patterns = [
    r'/api/v\d+/',  # API endpoints
    r'\.(pdf|zip|tar|gz|png|jpg|jpeg|gif|svg|ico|css|js|woff|ttf)$',
    r'/changelog',
    r'/releases',
    r'/blog/',
    r'/search',
    r'/login',
    r'/signup',
    r'/auth/',
    r'#',  # Anchor links
]

Output Structure

Combined File Format

# Complete Documentation: docs.example.com

Source: https://docs.example.com
Scraped: 2026-03-05T10:30:00
Total Pages: 47

---

## Table of Contents

1. [Getting Started](#getting-started)
2. [API Reference](#api-reference)
3. [Authentication](#authentication)
...

---

## Getting Started

*Source: https://docs.example.com/getting-started*

[Page content here...]

---

## API Reference

*Source: https://docs.example.com/api*

[Page content here...]

---

Individual File Format

---
title: Getting Started
source: https://docs.example.com/getting-started
scraped: 2026-03-05T10:30:00.123456
---

# Getting Started

[Clean markdown content...]

Use Cases

AI Context

Upload _COMPLETE_DOCS.md to Gemini/Claude for instant expertise on any tool or framework

Offline Docs

Keep local markdown copies of documentation for offline reference

Documentation Backup

Archive documentation versions before they change

Custom Learning

Create personalized learning materials from official docs

Best Practices

1

Start Small

Test with --max-pages 10 first to verify output quality
2

Respect Servers

Use appropriate --delay values (0.5-1s recommended)
3

Check Output

Review _manifest.json to see failed pages and adjust filters
4

Upload to AI

Use _COMPLETE_DOCS.md for AI uploads - itโ€™s formatted perfectly
Some documentation sites have rate limiting. If you see many failures, increase the --delay value.

File Size Considerations

# Check combined file size
ls -lh docs_*/\_COMPLETE_DOCS.md

# If > 10MB, consider:
# 1. Reducing --max-pages
# 2. Using individual files instead
# 3. Splitting by section
AI models have context limits. Gemini 1.5 Pro supports up to 2M tokens (~1.5M words), but smaller files are faster to process.

Troubleshooting

Check dependencies:
pip list | grep -E "requests|beautifulsoup4|markdownify"
Install missing packages:
pip install requests beautifulsoup4 markdownify
  • Verify the base URL is accessible
  • Check if site requires authentication
  • Try a simpler URL (homepage first)
  • Increase --delay to respect rate limits
  • Check _manifest.json for error messages
  • Some pages might require JavaScript (not supported)
  • Documentation sites vary in structure
  • Try adjusting content selectors in the code
  • Some cleanup might be needed post-scrape

Example Session

$ python docs_eater.py https://docs.anthropic.com --max-pages 20

๐Ÿ”ฎ DOCS EATER
==================================================
๐Ÿ“ Base URL: https://docs.anthropic.com
๐Ÿ“ Output: docs_docs.anthropic.com_20260305_103045
๐Ÿ“„ Max pages: 20
==================================================

[1/20] ๐Ÿ“– https://docs.anthropic.com...
  โœ… Saved: index.md
[2/20] ๐Ÿ“– https://docs.anthropic.com/en/docs/intro-to-claude...
  โœ… Saved: en_docs_intro_to_claude.md
[3/20] ๐Ÿ“– https://docs.anthropic.com/en/api/getting-started...
  โœ… Saved: en_api_getting_started.md
...

==================================================
๐ŸŽ‰ SCRAPING COMPLETE!
==================================================
โœ… Pages scraped: 18
โŒ Failed: 2
๐Ÿ“ Output folder: docs_docs.anthropic.com_20260305_103045

๐Ÿ“„ Key files:
   โ€ข _COMPLETE_DOCS.md - Single file with ALL docs (upload this to Gemini!)
   โ€ข _manifest.json - Index of all scraped pages
   โ€ข Individual .md files for each page

๐Ÿ“Š Combined file size: 2.34 MB

Download Docs Eater

Get the Python script and start scraping documentation

Build docs developers (and LLMs) love