Scraping Sources - Grounded Docs MCP Server

Grounded Docs supports indexing documentation from multiple source types. Each source has its own strategy for fetching and processing content.

Supported Source Types

Websites

HTTP/HTTPS documentation sites

GitHub

Repositories, wikis, and specific branches

npm

npm package READMEs and docs

PyPI

Python package documentation

Local Files

Filesystem directories and files

ZIP Archives

Compressed documentation bundles

Web Documentation

Scrape documentation from HTTP/HTTPS URLs.

Basic Usage

npx @arabold/docs-mcp-server@latest scrape react https://react.dev/reference/react

Advanced Options

maxPages

number

default:"1000"

Maximum number of pages to scrape

maxDepth

number

default:"10"

Maximum link depth to follow

includePatterns

string[]

Only scrape URLs matching these glob patternsExample: ["**/api/**", "**/reference/**"]

excludePatterns

string[]

Skip URLs matching these patternsExample: ["**/blog/**", "**/changelog/**"]

Examples

React Docs
Next.js Docs
Python Docs

npx @arabold/docs-mcp-server@latest scrape react \
  --version 18.3.1 \
  --max-pages 500 \
  --include '**/reference/**' \
  https://react.dev/reference/react

npx @arabold/docs-mcp-server@latest scrape next.js \
  --version 14.x \
  --max-depth 5 \
  --exclude '**/blog/**' \
  https://nextjs.org/docs

npx @arabold/docs-mcp-server@latest scrape python \
  --version 3.12 \
  --max-pages 1000 \
  --include '**/library/**' \
  https://docs.python.org/3/library/

Use --include patterns to focus on specific documentation sections and speed up indexing.

GitHub Repositories

Index documentation directly from GitHub repositories.

URL Formats

https://github.com/facebook/react

GitHub Authentication

For private repositories or to avoid rate limits, configure GitHub authentication:

GITHUB_TOKEN="ghp_your_personal_access_token" \
npx @arabold/docs-mcp-server@latest scrape my-lib https://github.com/owner/repo

See Authentication for complete GitHub auth setup.

Supported File Types

GitHub strategy processes:

Markdown files (.md, .mdx)
Documentation files (.txt, .rst)
Source code (.js, .ts, .py, .java, etc.)
Configuration files (.json, .yaml, .toml)

Examples

Full Repository
Docs Directory
Private Repo

npx @arabold/docs-mcp-server@latest scrape typescript \
  --version 5.3 \
  https://github.com/microsoft/TypeScript

npx @arabold/docs-mcp-server@latest scrape vue \
  --version 3.x \
  https://github.com/vuejs/core/tree/main/packages/vue

GITHUB_TOKEN="ghp_..." npx @arabold/docs-mcp-server@latest scrape internal-lib \
  --version 1.0.0 \
  https://github.com/company/internal-library

npm Packages

Index npm package documentation and READMEs.

URL Format

npm:<package-name>

Examples

npx @arabold/docs-mcp-server@latest scrape lodash npm:lodash --version 4.17.21

What Gets Indexed

README.md and documentation files
Package.json metadata
TypeScript type definitions
Extracted from published npm tarball

npm packages are downloaded as tarballs and extracted. Documentation files are automatically detected and processed.

PyPI Packages

Index Python package documentation from PyPI.

URL Format

pypi:<package-name>

Examples

npx @arabold/docs-mcp-server@latest scrape requests pypi:requests --version 2.31.0

What Gets Indexed

README and documentation files
Package metadata
Python source code docstrings
Extracted from PyPI distributions

Local Files

Index documentation from your local filesystem.

URL Format

file:///<absolute-path>

Platform Examples

macOS/Linux
Windows

Single File

file:///Users/me/docs/README.md

Supported File Types

Documentation

.md, .mdx, .txt, .rst, .asciidoc

Source Code

.js, .ts, .py, .java, .go, .rs, etc.

Office Docs

.docx, .xlsx, .pptx

Others

.pdf, .html, .json, .yaml

Docker Volumes

When using Docker, mount local directories:

Mount the volume

docker run --rm \
  -v /path/to/docs:/docs:ro \
  -v docs-mcp-data:/data \
  -p 6280:6280 \
  ghcr.io/arabold/docs-mcp-server:latest

Use container path

# Inside container or via web UI
npx @arabold/docs-mcp-server@latest scrape my-docs file:///docs

Local file paths must be absolute. Relative paths are not supported.

Examples

Project Docs
Internal Wiki
Single File

npx @arabold/docs-mcp-server@latest scrape my-project \
  --version 1.0.0 \
  file:///Users/me/projects/my-app/docs

npx @arabold/docs-mcp-server@latest scrape wiki \
  --version latest \
  file:///home/user/company-wiki

npx @arabold/docs-mcp-server@latest scrape readme \
  --version 1.0 \
  file:///Users/me/project/README.md

ZIP Archives

Index documentation from ZIP files.

URL Format

file:///<absolute-path-to-zip>

Example

npx @arabold/docs-mcp-server@latest scrape archived-docs \
  --version 2.0 \
  file:///Users/me/downloads/docs-archive.zip

Processing

Extract

ZIP file is extracted to a temporary directory

Process

All supported file types are indexed

Clean Up

Temporary files are deleted after processing

ZIP archives are processed the same way as local directories. All supported file types are extracted and indexed.

Best Practices

Version Naming

Use semantic versioning patterns:

Exact: 1.2.3, 18.3.1
X-Range: 18.x, 3.x (matches latest minor/patch)
Latest: latest (for most recent version)

Consistent version naming helps with search and version resolution.

URL Filtering

Use include/exclude patterns to:

Focus on relevant sections (**/api/**, **/reference/**)
Skip non-documentation (**/blog/**, **/changelog/**)
Reduce indexing time and storage
Improve search relevance

Organize Libraries

Use the organization field to group related libraries:

# React ecosystem
--organization facebook

# Internal company docs
--organization acme-corp

# Tools and frameworks
--organization dev-tools

Refresh vs Re-scrape

Use refresh for updates to existing documentation (faster, uses ETags)
Use scrape for new libraries or complete re-indexing

See refresh command for details.

Troubleshooting

Rate Limiting

Problem: HTTP 429 errors or slow scrapingSolutions:

Reduce --max-concurrency (default: 5)
Use --delay to add delays between requests
For GitHub: Add GITHUB_TOKEN for higher rate limits

Authentication Errors

Problem: 401/403 errors for private contentSolutions:

Add appropriate credentials (GITHUB_TOKEN, etc.)
Verify token has required permissions
Check token expiration

File Not Found

Problem: Local file paths failSolutions:

Use absolute paths, not relative
Check file/directory exists and is readable
For Docker: Verify volume mount is correct
Check path uses forward slashes

Large Repositories

Problem: Scraping takes too long or times outSolutions:

Use --include patterns to limit scope
Increase --max-pages if needed
For GitHub: Target specific directories instead of full repo
Consider splitting into multiple libraries

Next Steps

Search Documentation

Learn how to search your indexed documentation

CLI Reference

Complete scrape command reference

Configuration

Configure scraper options and defaults

MCP Tools

Use scraping from your AI assistant

Getting Started

Setup

Guides

Architecture

Infrastructure

​Supported Source Types

Websites

GitHub

npm

PyPI

Local Files

ZIP Archives

​Web Documentation

​Basic Usage

​Advanced Options

​Examples

​GitHub Repositories

​URL Formats

​GitHub Authentication

​Supported File Types

​Examples

​npm Packages

​URL Format

​Examples

​What Gets Indexed

​PyPI Packages

​URL Format

​Examples

​What Gets Indexed

​Local Files

​URL Format

​Platform Examples

​Supported File Types

Documentation

Source Code

Office Docs

Others

​Docker Volumes

​Examples

​ZIP Archives

​URL Format

​Example

​Processing

​Best Practices

​Troubleshooting

​Next Steps

Search Documentation

CLI Reference

Configuration

MCP Tools

Build docs developers (and LLMs) love

Supported Source Types

Web Documentation

Basic Usage

Advanced Options

Examples

GitHub Repositories

URL Formats

GitHub Authentication

Supported File Types

Examples

npm Packages

URL Format

Examples

What Gets Indexed

PyPI Packages

URL Format

Examples

What Gets Indexed

Local Files

URL Format

Platform Examples

Supported File Types

Docker Volumes

Examples

ZIP Archives

URL Format

Example

Processing

Best Practices

Troubleshooting

Next Steps