scrape

Usage

docs-mcp-server scrape <library> <url> [options]

Arguments

library

string

required

Library name to use for indexing

url

string

required

URL or file:// path to scrape. Can be:

HTTP/HTTPS URL: https://react.dev/reference/react
Local file: file:///Users/me/docs/index.html
Local directory: file:///Users/me/docs/my-library

Options

Basic Options

--version, -v

string

Version of the library (optional). If not specified, the library will be indexed without a version.

--max-pages, -p

number

default:"1000"

Maximum number of pages to scrape. Prevents runaway crawls.

--max-depth, -d

number

default:"10"

Maximum navigation depth from the starting URL. Links beyond this depth will not be followed.

--max-concurrency, -c

number

default:"5"

Maximum number of concurrent page requests. Higher values speed up scraping but may overwhelm the server.

--clean

boolean

default:"true"

Clear existing documents before scraping. Set to false to append to existing documentation.

Crawling Options

--scope

string

default:"subpages"

Defines the crawling boundary:

subpages - Only crawl URLs under the starting URL path
hostname - Crawl any URL on the same hostname
domain - Crawl any URL on the same domain (including subdomains)

--follow-redirects

boolean

default:"true"

Follow HTTP redirects during scraping. Use --no-follow-redirects to disable.

--ignore-errors

boolean

default:"true"

Continue scraping even if individual pages fail to load

Content Processing

--scrape-mode

string

default:"auto"

HTML processing strategy:

auto - Automatically detect the best extraction method
markdown - Convert HTML to Markdown
html - Use raw HTML
text - Extract plain text only

URL Filtering

--include-pattern

string[]

Glob or regex pattern for URLs to include. Can be specified multiple times. Regex patterns must be wrapped in slashes (e.g., /pattern/).Example:

--include-pattern "*/api/*" --include-pattern "/docs.*html/"

--exclude-pattern

string[]

Glob or regex pattern for URLs to exclude. Takes precedence over include patterns. Can be specified multiple times. Regex patterns must be wrapped in slashes.Example:

--exclude-pattern "*/blog/*" --exclude-pattern "*/changelog/*"

Request Headers

--header

string[]

Custom HTTP header to send with each request. Can be specified multiple times. Format: Header-Name: Header-ValueExample:

--header "Authorization: Bearer token123" \
--header "User-Agent: MyBot/1.0"

Embedding Configuration

--embedding-model

string

Embedding model configuration in format provider:model-name.Examples:

openai:text-embedding-3-small
vertex:text-embedding-004
bedrock:amazon.titan-embed-text-v1
azure:text-embedding-ada-002

See Embedding Models for full details.

Remote Worker

--server-url

string

URL of external pipeline worker RPC endpoint.Example: http://localhost:8080/apiWhen specified, the scraping job is sent to the remote worker instead of processing locally.

Examples

Basic Scraping

# Scrape React documentation
docs-mcp-server scrape react https://react.dev/reference/react

# Scrape with version
docs-mcp-server scrape react https://react.dev/reference/react --version 18.0.0

Local Files

# Scrape single HTML file
docs-mcp-server scrape mylib file:///Users/me/docs/index.html

# Scrape directory
docs-mcp-server scrape mylib file:///Users/me/docs/my-library

When using Docker, mount the local directory:

docker run -v /Users/me/docs:/docs \
  ghcr.io/arabold/docs-mcp-server \
  scrape mylib file:///docs

Advanced Crawling

# Limit scope to API documentation only
docs-mcp-server scrape mylib https://docs.example.com/api \
  --scope subpages \
  --max-pages 500 \
  --max-depth 5

# Use include/exclude patterns
docs-mcp-server scrape mylib https://docs.example.com \
  --include-pattern "*/api/*" \
  --include-pattern "*/reference/*" \
  --exclude-pattern "*/blog/*"

Custom Headers and Authentication

# Scrape with authentication
docs-mcp-server scrape private-docs https://private.example.com \
  --header "Authorization: Bearer $TOKEN" \
  --header "X-Custom-Header: value"

Performance Tuning

# Fast scraping with high concurrency
docs-mcp-server scrape mylib https://docs.example.com \
  --max-concurrency 10 \
  --max-pages 2000

# Conservative scraping (slow but safe)
docs-mcp-server scrape mylib https://docs.example.com \
  --max-concurrency 2 \
  --max-depth 3

Using Custom Embedding Models

# OpenAI embeddings
docs-mcp-server scrape react https://react.dev \
  --embedding-model openai:text-embedding-3-small

# Vertex AI embeddings
docs-mcp-server scrape react https://react.dev \
  --embedding-model vertex:text-embedding-004

Remote Worker

# Send scraping job to remote worker
docs-mcp-server scrape react https://react.dev \
  --server-url http://worker.example.com:8080/api

Output

The command displays progress during scraping:

⏳ Initializing scraping job...
🚀 Scraping react v18.0.0...
📄 Scraping react v18.0.0: 45/100 pages
📄 Scraping react v18.0.0: 100/100 pages
✅ Successfully scraped 100 pages

Behavior

Clean Mode (Default)

By default, --clean true removes all existing documents for the library/version before scraping:

# This removes existing React 18.0.0 docs before scraping
docs-mcp-server scrape react https://react.dev --version 18.0.0

Append Mode

Use --clean false to append to existing documentation:

# This adds to existing React docs without removing them
docs-mcp-server scrape react https://react.dev/hooks --clean false

URL Scope

The --scope option controls crawling boundaries:

# Only crawl /api/* and below
docs-mcp-server scrape mylib https://example.com/api --scope subpages
# Result: https://example.com/api/endpoint ✓
# Result: https://example.com/guide ✗

# Crawl entire example.com hostname
docs-mcp-server scrape mylib https://example.com/api --scope hostname
# Result: https://example.com/guide ✓
# Result: https://api.example.com ✗

# Crawl all *.example.com domains
docs-mcp-server scrape mylib https://example.com/api --scope domain
# Result: https://api.example.com ✓
# Result: https://other.com ✗

Pattern Matching

Patterns support both glob and regex:

# Glob patterns
--include-pattern "*/api/*"        # Matches /docs/api/endpoint
--exclude-pattern "*.pdf"          # Excludes PDF files

# Regex patterns (wrapped in slashes)
--include-pattern "/api\/v[0-9]+/"  # Matches /api/v1, /api/v2, etc.
--exclude-pattern "/\\.(jpg|png)$/" # Excludes image files

Error Handling

Invalid URL

❌ Scraping failed: Invalid URL format

Authentication Required

❌ Scraping failed: HTTP 401 Unauthorized
# Solution: Add authentication header
docs-mcp-server scrape mylib https://example.com \
  --header "Authorization: Bearer $TOKEN"

Embedding Model Not Configured

❌ OpenAI API key not configured
# Solution: Set environment variable or use --embedding-model
export OPENAI_API_KEY=sk-...

Commands

Options

Usage

Arguments

Options

Basic Options

Crawling Options

Content Processing

URL Filtering

Request Headers

Embedding Configuration

Remote Worker

Examples

Basic Scraping

Local Files

Advanced Crawling

Custom Headers and Authentication

Performance Tuning

Using Custom Embedding Models

Remote Worker

Output

Behavior

Clean Mode (Default)

Append Mode

URL Scope

Pattern Matching

Error Handling

See Also

Build docs developers (and LLMs) love

Commands

Options

​Usage

​Arguments

​Options

​Basic Options

​Crawling Options

​Content Processing

​URL Filtering

​Request Headers

​Embedding Configuration

​Remote Worker

​Examples

​Basic Scraping

​Local Files

​Advanced Crawling

​Custom Headers and Authentication

​Performance Tuning

​Using Custom Embedding Models

​Remote Worker

​Output

​Behavior

​Clean Mode (Default)

​Append Mode

​URL Scope

​Pattern Matching

​Error Handling

​See Also

Build docs developers (and LLMs) love

Usage

Arguments

Options

Basic Options

Crawling Options

Content Processing

URL Filtering

Request Headers

Embedding Configuration

Remote Worker

Examples

Basic Scraping

Local Files

Advanced Crawling

Custom Headers and Authentication

Performance Tuning

Using Custom Embedding Models

Remote Worker

Output

Behavior

Clean Mode (Default)

Append Mode

URL Scope

Pattern Matching

Error Handling

See Also