Supported Source Types
Websites
HTTP/HTTPS documentation sites
GitHub
Repositories, wikis, and specific branches
npm
npm package READMEs and docs
PyPI
Python package documentation
Local Files
Filesystem directories and files
ZIP Archives
Compressed documentation bundles
Web Documentation
Scrape documentation from HTTP/HTTPS URLs.Basic Usage
Advanced Options
Maximum number of pages to scrape
Maximum link depth to follow
Only scrape URLs matching these glob patternsExample:
["**/api/**", "**/reference/**"]Skip URLs matching these patternsExample:
["**/blog/**", "**/changelog/**"]Examples
- React Docs
- Next.js Docs
- Python Docs
GitHub Repositories
Index documentation directly from GitHub repositories.URL Formats
GitHub Authentication
For private repositories or to avoid rate limits, configure GitHub authentication:See Authentication for complete GitHub auth setup.
Supported File Types
GitHub strategy processes:- Markdown files (
.md,.mdx) - Documentation files (
.txt,.rst) - Source code (
.js,.ts,.py,.java, etc.) - Configuration files (
.json,.yaml,.toml)
Examples
- Full Repository
- Docs Directory
- Private Repo
npm Packages
Index npm package documentation and READMEs.URL Format
Examples
What Gets Indexed
- README.md and documentation files
- Package.json metadata
- TypeScript type definitions
- Extracted from published npm tarball
npm packages are downloaded as tarballs and extracted. Documentation files are automatically detected and processed.
PyPI Packages
Index Python package documentation from PyPI.URL Format
Examples
What Gets Indexed
- README and documentation files
- Package metadata
- Python source code docstrings
- Extracted from PyPI distributions
Local Files
Index documentation from your local filesystem.URL Format
Platform Examples
- macOS/Linux
- Windows
Single File
Directory
Supported File Types
Documentation
.md, .mdx, .txt, .rst, .asciidocSource Code
.js, .ts, .py, .java, .go, .rs, etc.Office Docs
.docx, .xlsx, .pptxOthers
.pdf, .html, .json, .yamlDocker Volumes
When using Docker, mount local directories:Examples
- Project Docs
- Internal Wiki
- Single File
ZIP Archives
Index documentation from ZIP files.URL Format
Example
Processing
ZIP archives are processed the same way as local directories. All supported file types are extracted and indexed.
Best Practices
Version Naming
Version Naming
Use semantic versioning patterns:
- Exact:
1.2.3,18.3.1 - X-Range:
18.x,3.x(matches latest minor/patch) - Latest:
latest(for most recent version)
URL Filtering
URL Filtering
Use include/exclude patterns to:
- Focus on relevant sections (
**/api/**,**/reference/**) - Skip non-documentation (
**/blog/**,**/changelog/**) - Reduce indexing time and storage
- Improve search relevance
Organize Libraries
Organize Libraries
Use the organization field to group related libraries:
Refresh vs Re-scrape
Refresh vs Re-scrape
- Use
refreshfor updates to existing documentation (faster, uses ETags) - Use
scrapefor new libraries or complete re-indexing
Troubleshooting
Rate Limiting
Rate Limiting
Problem: HTTP 429 errors or slow scrapingSolutions:
- Reduce
--max-concurrency(default: 5) - Use
--delayto add delays between requests - For GitHub: Add
GITHUB_TOKENfor higher rate limits
Authentication Errors
Authentication Errors
Problem: 401/403 errors for private contentSolutions:
- Add appropriate credentials (GITHUB_TOKEN, etc.)
- Verify token has required permissions
- Check token expiration
File Not Found
File Not Found
Problem: Local file paths failSolutions:
- Use absolute paths, not relative
- Check file/directory exists and is readable
- For Docker: Verify volume mount is correct
- Check path uses forward slashes
Large Repositories
Large Repositories
Problem: Scraping takes too long or times outSolutions:
- Use
--includepatterns to limit scope - Increase
--max-pagesif needed - For GitHub: Target specific directories instead of full repo
- Consider splitting into multiple libraries
Next Steps
Search Documentation
Learn how to search your indexed documentation
CLI Reference
Complete scrape command reference
Configuration
Configure scraper options and defaults
MCP Tools
Use scraping from your AI assistant
