Docs Eater
๐ฎ Documentation Site Scraper for AI Upload
A powerful Python CLI tool that scrapes entire documentation sites and outputs clean markdown files ready for uploading to Gemini, Claude, or any AI for personalized walkthroughs.
Overview
Docs Eater intelligently crawls documentation websites, extracts main content, converts it to clean markdown, and creates both individual files and a single combined document perfect for AI context.
Smart Crawling Automatically discovers and follows documentation links
Clean Output Removes navigation, sidebars, and clutter - keeps only content
AI-Ready Format Single combined file with frontmatter and table of contents
Configurable Control max pages, delays, output format, and more
Features
Docs Eater knows how to find the actual content:
Removes unwanted elements (nav, header, footer, sidebar)
Tries multiple content selectors (main, article, .content, etc.)
Extracts clean page titles
Preserves code blocks with language detection
Maintains heading hierarchy
Smart Link Following
Start with Base URL
Begins crawling from your provided documentation URL
Extract Links
Finds all links on each page
Filter Relevance
Skips non-doc patterns (blog, login, PDFs, images, etc.)
Stay in Scope
Only follows links within the same domain and base path
Respect Limits
Stops at max pages to avoid overwhelming API limits
Combined File
Individual Files
Manifest
_COMPLETE_DOCS.md - Single file with:
Metadata header (source URL, scrape time, page count)
Full table of contents
All pages concatenated with separators
Perfect for uploading to Gemini/Claude
Separate .md file for each page:
YAML frontmatter with title, source, timestamp
Clean markdown content
SEO-friendly filenames
_manifest.json - Index of everything:{
"base_url" : "https://docs.example.com" ,
"scraped_at" : "2026-03-05T10:30:00" ,
"total_pages" : 47 ,
"failed_pages" : 2 ,
"pages" : [ ... ],
"failed" : [ ... ]
}
Basic Usage
Simple
With Options
Custom Output
python docs_eater.py https://docs.anthropic.com
Command-Line Options
Option Short Description Default url- Base URL of the documentation site Required --output-oOutput directory docs_<domain>_<timestamp>--max-pages-mMaximum pages to scrape 100 --delay-dDelay between requests (seconds) 0.5
Examples
python docs_eater.py https://docs.anthropic.com
Output: docs_docs.anthropic.com_20260305_103045/
โโโ _COMPLETE_DOCS.md
โโโ _manifest.json
โโโ index.md
โโโ api_reference.md
โโโ claude_quickstart.md
โโโ ...
python docs_eater.py https://docs.stripe.com/api --max-pages 50
Scrapes up to 50 pages from Stripeโs API documentation.
python docs_eater.py https://fastapi.tiangolo.com -o fastapi_docs -m 75
Creates fastapi_docs/ directory with up to 75 pages.
Installation
Install Dependencies
pip install requests beautifulsoup4 markdownify
Download Script
Save docs_eater.py to your tools directory
Make Executable (Optional)
Run
python docs_eater.py < ur l >
How It Works
Architecture
class DocsEater :
def __init__ ( self , base_url , output_dir , max_pages , delay ):
# Initialize crawler with settings
def normalize_url ( self , url ):
# Remove fragments, trailing slashes for deduplication
def is_valid_docs_url ( self , url ):
# Check if URL should be scraped
# - Same domain?
# - Under base path?
# - Not a blog/login/asset?
def extract_content ( self , soup , url ):
# Find main content area
# Remove nav, header, footer, aside
# Return clean HTML
def html_to_markdown ( self , html_content , title , url ):
# Convert HTML to markdown
# Add frontmatter
# Clean up formatting
def scrape_page ( self , url ):
# Fetch page
# Extract title and content
# Find new links
# Return results
def crawl ( self ):
# BFS crawl through pages
# Track visited URLs
# Respect max_pages limit
# Save as we go
The tool tries multiple selectors to find main content:
content_selectors = [
'main' ,
'article' ,
'[role="main"]' ,
'.content' ,
'.main-content' ,
'.documentation' ,
'.docs-content' ,
'.markdown-body' ,
'.prose' ,
'#content' ,
'#main' ,
'.article-content' ,
]
Fallback to <body> if none found.
Skip Patterns
Automatically skips:
skip_patterns = [
r '/api/v \d + /' , # API endpoints
r ' \. ( pdf | zip | tar | gz | png | jpg | jpeg | gif | svg | ico | css | js | woff | ttf ) $ ' ,
r '/changelog' ,
r '/releases' ,
r '/blog/' ,
r '/search' ,
r '/login' ,
r '/signup' ,
r '/auth/' ,
r '#' , # Anchor links
]
Output Structure
# Complete Documentation: docs.example.com
Source: https://docs.example.com
Scraped: 2026-03-05T10:30:00
Total Pages: 47
---
## Table of Contents
1. [ Getting Started ]( #getting-started )
2. [ API Reference ]( #api-reference )
3. [ Authentication ]( #authentication )
...
---
## Getting Started
*Source: https://docs.example.com/getting-started*
[Page content here...]
---
## API Reference
*Source: https://docs.example.com/api*
[Page content here...]
---
---
title : Getting Started
source : https://docs.example.com/getting-started
scraped : 2026-03-05T10:30:00.123456
---
# Getting Started
[Clean markdown content...]
Use Cases
AI Context Upload _COMPLETE_DOCS.md to Gemini/Claude for instant expertise on any tool or framework
Offline Docs Keep local markdown copies of documentation for offline reference
Documentation Backup Archive documentation versions before they change
Custom Learning Create personalized learning materials from official docs
Best Practices
Start Small
Test with --max-pages 10 first to verify output quality
Respect Servers
Use appropriate --delay values (0.5-1s recommended)
Check Output
Review _manifest.json to see failed pages and adjust filters
Upload to AI
Use _COMPLETE_DOCS.md for AI uploads - itโs formatted perfectly
Some documentation sites have rate limiting. If you see many failures, increase the --delay value.
File Size Considerations
# Check combined file size
ls -lh docs_ * / \_ COMPLETE_DOCS.md
# If > 10MB, consider:
# 1. Reducing --max-pages
# 2. Using individual files instead
# 3. Splitting by section
AI models have context limits. Gemini 1.5 Pro supports up to 2M tokens (~1.5M words), but smaller files are faster to process.
Troubleshooting
Verify the base URL is accessible
Check if site requires authentication
Try a simpler URL (homepage first)
Increase --delay to respect rate limits
Check _manifest.json for error messages
Some pages might require JavaScript (not supported)
Documentation sites vary in structure
Try adjusting content selectors in the code
Some cleanup might be needed post-scrape
Example Session
$ python docs_eater.py https://docs.anthropic.com --max-pages 20
๐ฎ DOCS EATER
==================================================
๐ Base URL: https://docs.anthropic.com
๐ Output: docs_docs.anthropic.com_20260305_103045
๐ Max pages: 20
==================================================
[1/20] ๐ https://docs.anthropic.com...
โ
Saved: index.md
[2/20] ๐ https://docs.anthropic.com/en/docs/intro-to-claude...
โ
Saved: en_docs_intro_to_claude.md
[3/20] ๐ https://docs.anthropic.com/en/api/getting-started...
โ
Saved: en_api_getting_started.md
...
==================================================
๐ SCRAPING COMPLETE!
==================================================
โ
Pages scraped: 18
โ Failed: 2
๐ Output folder: docs_docs.anthropic.com_20260305_103045
๐ Key files:
โข _COMPLETE_DOCS.md - Single file with ALL docs (upload this to Gemini! )
โข _manifest.json - Index of all scraped pages
โข Individual .md files for each page
๐ Combined file size: 2.34 MB
Download Docs Eater Get the Python script and start scraping documentation