Scraping Tools

The scraping tools allow you to index new documentation, refresh existing versions, and remove indexed content.

These tools are disabled in read-only mode. Configure readOnly: false to enable write operations.

scrape_docs

Index documentation from a URL for a library version. MCP Tool Name: scrape_docs Source: src/tools/ScrapeTool.ts

Parameters

library

string

required

Library name to index

url

string

required

Documentation root URL to scrape (must be valid HTTP/HTTPS)

version

string | null

Library version (e.g., “5.4.2”, “18.0.0”). Omit or pass null for unversioned docs.

options

object

Scraping configuration options

maxPages

number

default:"1000"

Maximum number of pages to scrape

maxDepth

number

default:"10"

Maximum navigation depth from the root URL

scope

'subpages' | 'hostname' | 'domain'

default:"subpages"

Crawling boundary:

subpages: Only crawl URLs within the same path as the starting URL
hostname: Crawl any URL on the same hostname
domain: Crawl any URL on the same top-level domain (including subdomains)

followRedirects

boolean

default:"true"

Whether to follow HTTP redirects (3xx responses). When false, throws RedirectError.

maxConcurrency

number

default:"5"

Maximum number of concurrent requests

ignoreErrors

boolean

default:"true"

Continue scraping if individual pages fail

scrapeMode

'fetch' | 'playwright' | 'auto'

default:"auto"

HTML processing strategy:

fetch: Simple DOM parser (faster, less JS support)
playwright: Headless browser (slower, full JS support)
auto: Automatically select best strategy

includePatterns

string[]

Regex patterns for including URLs. Regex patterns must be wrapped in slashes: /pattern/

excludePatterns

string[]

Regex patterns for excluding URLs. Exclude takes precedence over include. Regex patterns must be wrapped in slashes: /pattern/

headers

Record<string, string>

Custom HTTP headers to send with each request (e.g., for authentication)

clean

boolean

default:"true"

If true, clears existing documents for the library version before scraping. If false, appends to existing documents.

waitForCompletion

boolean

default:"false"

Whether to wait for scraping to complete. When false (default for MCP), returns jobId immediately.

Response Structure

jobId

string

UUID of the scraping job (returned when waitForCompletion is false)

pagesScraped

number

Number of pages successfully scraped (returned when waitForCompletion is true)

TypeScript Types

interface ScrapeToolOptions {
  library: string;
  version?: string | null;
  url: string;
  options?: {
    maxPages?: number;
    maxDepth?: number;
    scope?: "subpages" | "hostname" | "domain";
    followRedirects?: boolean;
    maxConcurrency?: number;
    ignoreErrors?: boolean;
    scrapeMode?: ScrapeMode;
    includePatterns?: string[];
    excludePatterns?: string[];
    headers?: Record<string, string>;
    clean?: boolean;
  };
  waitForCompletion?: boolean;
}

enum ScrapeMode {
  Fetch = "fetch",
  Playwright = "playwright",
  Auto = "auto"
}

type ScrapeExecuteResult = ScrapeResult | { jobId: string };

See src/tools/ScrapeTool.ts:8-73 for complete type definitions.

Example Request

{
  "name": "scrape_docs",
  "arguments": {
    "library": "react",
    "version": "18.2.0",
    "url": "https://react.dev/",
    "maxPages": 500,
    "maxDepth": 5,
    "scope": "subpages",
    "followRedirects": true
  }
}

Example Response

{
  "jobId": "550e8400-e29b-41d4-a716-446655440000"
}

MCP Output

When called through MCP (see src/mcp/mcpServer.ts:42):

🚀 Scraping job started with ID: 550e8400-e29b-41d4-a716-446655440000.

Error Cases

Invalid version format:

{
  "error": "Invalid version format for scraping: 'invalid'. Use 'X.Y.Z', 'X.Y.Z-prerelease', 'X.Y', 'X', or omit."
}

Invalid URL:

{
  "error": "URL must be a valid HTTP/HTTPS URL."
}

Version Format Support

The tool accepts and normalizes various version formats:

Full semver: "18.2.0" → "18.2.0"
Partial version: "18.2" → "18.2.0" (coerced)
Major only: "18" → "18.0.0" (coerced)
Unversioned: null or "" → "" (empty string)
Prerelease: "18.3.0-rc.1" → "18.3.0-rc.1"

See src/tools/ScrapeTool.ts:98-124 for version normalization logic.

Scope Examples

Given starting URL https://react.dev/reference/react/hooks: subpages (default):

✅ https://react.dev/reference/react/hooks/useState
✅ https://react.dev/reference/react/hooks/useEffect
❌ https://react.dev/learn (different path)
❌ https://legacy.reactjs.org/ (different hostname)

hostname:

✅ https://react.dev/reference/react/hooks
✅ https://react.dev/learn
❌ https://legacy.reactjs.org/ (different hostname)

domain:

✅ https://react.dev/reference/react/hooks
✅ https://legacy.react.dev/ (subdomain)
❌ https://reactjs.org/ (different domain)

refresh_version

Re-scrape a previously indexed library version, updating only changed pages. MCP Tool Name: refresh_version Source: src/tools/RefreshTool.ts (uses same underlying ScrapeTool)

Parameters

library

string

required

Library name to refresh

version

string

Version to refresh (defaults to latest if omitted)

Response Structure

jobId

string

UUID of the refresh job

Example Request

{
  "name": "refresh_version",
  "arguments": {
    "library": "react",
    "version": "18.2.0"
  }
}

MCP Output

🔄 Refresh job started with ID: 550e8400-e29b-41d4-a716-446655440000.

Difference from scrape_docs

scrape_docs: Indexes from scratch (clears existing data by default)
refresh_version: Re-scrapes existing version, only updates changed pages

remove_docs

Remove indexed documentation for a library version. MCP Tool Name: remove_docs Source: src/tools/RemoveTool.ts

This is a destructive operation. All documents for the specified version will be permanently deleted.

Parameters

library

string

required

Library name

version

string

Version to remove (defaults to latest if omitted)

Response Structure

message

string

Confirmation message

Example Request

{
  "name": "remove_docs",
  "arguments": {
    "library": "react",
    "version": "17.0.2"
  }
}

MCP Output

Removed documentation for [email protected].

Error Cases

Library not found:

{
  "error": "Library 'nonexistent' not found in store."
}

Version not found:

{
  "error": "Version '99.0.0' not found for library 'react'."
}

Job Management

Since scraping operations are asynchronous, use the job management tools to monitor progress:

list_jobs - See all scraping jobs
get_job_info - Check job status and progress
cancel_job - Stop a running job

Example Workflow

// 1. Start scraping
const { jobId } = await scrapeDocs({
  library: "react",
  version: "18.2.0",
  url: "https://react.dev/"
});

// 2. Check progress
const { job } = await getJobInfo({ jobId });
console.log(`Status: ${job.status}, Progress: ${job.progress?.pages}/${job.progress?.totalPages}`);

// 3. Wait for completion
while (job.status === "running" || job.status === "queued") {
  await sleep(5000);
  const { job: updated } = await getJobInfo({ jobId });
  console.log(`Progress: ${updated.progress?.pages}/${updated.progress?.totalPages}`);
}

console.log(`Completed: ${job.status}`);

list_jobs - Monitor scraping jobs
get_job_info - Check job progress
cancel_job - Cancel a running job
search_docs - Search indexed documentation

Available Tools

scrape_docs

Parameters

Response Structure

TypeScript Types

Example Request

Example Response

MCP Output

Error Cases

Version Format Support

Scope Examples

refresh_version

Parameters

Response Structure

Example Request

MCP Output

Difference from scrape_docs

remove_docs

Parameters

Response Structure

Example Request

MCP Output

Error Cases

Job Management

Example Workflow

Build docs developers (and LLMs) love

Available Tools

​scrape_docs

​Parameters

​Response Structure

​TypeScript Types

​Example Request

​Example Response

​MCP Output

​Error Cases

​Version Format Support

​Scope Examples

​refresh_version

​Parameters

​Response Structure

​Example Request

​MCP Output

​Difference from scrape_docs

​remove_docs

​Parameters

​Response Structure

​Example Request

​MCP Output

​Error Cases

​Job Management

​Example Workflow

​Related Tools

Build docs developers (and LLMs) love

scrape_docs

Parameters

Response Structure

TypeScript Types

Example Request

Example Response

MCP Output

Error Cases

Version Format Support

Scope Examples

refresh_version

Parameters

Response Structure

Example Request

MCP Output

Difference from scrape_docs

remove_docs

Parameters

Response Structure

Example Request

MCP Output

Error Cases

Job Management

Example Workflow

Related Tools