Skip to main content
The scraping tools allow you to index new documentation, refresh existing versions, and remove indexed content.
These tools are disabled in read-only mode. Configure readOnly: false to enable write operations.

scrape_docs

Index documentation from a URL for a library version. MCP Tool Name: scrape_docs Source: src/tools/ScrapeTool.ts

Parameters

library
string
required
Library name to index
url
string
required
Documentation root URL to scrape (must be valid HTTP/HTTPS)
version
string | null
Library version (e.g., “5.4.2”, “18.0.0”). Omit or pass null for unversioned docs.
options
object
Scraping configuration options
maxPages
number
default:"1000"
Maximum number of pages to scrape
maxDepth
number
default:"10"
Maximum navigation depth from the root URL
scope
'subpages' | 'hostname' | 'domain'
default:"subpages"
Crawling boundary:
  • subpages: Only crawl URLs within the same path as the starting URL
  • hostname: Crawl any URL on the same hostname
  • domain: Crawl any URL on the same top-level domain (including subdomains)
followRedirects
boolean
default:"true"
Whether to follow HTTP redirects (3xx responses). When false, throws RedirectError.
maxConcurrency
number
default:"5"
Maximum number of concurrent requests
ignoreErrors
boolean
default:"true"
Continue scraping if individual pages fail
scrapeMode
'fetch' | 'playwright' | 'auto'
default:"auto"
HTML processing strategy:
  • fetch: Simple DOM parser (faster, less JS support)
  • playwright: Headless browser (slower, full JS support)
  • auto: Automatically select best strategy
includePatterns
string[]
Regex patterns for including URLs. Regex patterns must be wrapped in slashes: /pattern/
excludePatterns
string[]
Regex patterns for excluding URLs. Exclude takes precedence over include. Regex patterns must be wrapped in slashes: /pattern/
headers
Record<string, string>
Custom HTTP headers to send with each request (e.g., for authentication)
clean
boolean
default:"true"
If true, clears existing documents for the library version before scraping. If false, appends to existing documents.
waitForCompletion
boolean
default:"false"
Whether to wait for scraping to complete. When false (default for MCP), returns jobId immediately.

Response Structure

jobId
string
UUID of the scraping job (returned when waitForCompletion is false)
pagesScraped
number
Number of pages successfully scraped (returned when waitForCompletion is true)

TypeScript Types

interface ScrapeToolOptions {
  library: string;
  version?: string | null;
  url: string;
  options?: {
    maxPages?: number;
    maxDepth?: number;
    scope?: "subpages" | "hostname" | "domain";
    followRedirects?: boolean;
    maxConcurrency?: number;
    ignoreErrors?: boolean;
    scrapeMode?: ScrapeMode;
    includePatterns?: string[];
    excludePatterns?: string[];
    headers?: Record<string, string>;
    clean?: boolean;
  };
  waitForCompletion?: boolean;
}

enum ScrapeMode {
  Fetch = "fetch",
  Playwright = "playwright",
  Auto = "auto"
}

type ScrapeExecuteResult = ScrapeResult | { jobId: string };
See src/tools/ScrapeTool.ts:8-73 for complete type definitions.

Example Request

{
  "name": "scrape_docs",
  "arguments": {
    "library": "react",
    "version": "18.2.0",
    "url": "https://react.dev/",
    "maxPages": 500,
    "maxDepth": 5,
    "scope": "subpages",
    "followRedirects": true
  }
}

Example Response

{
  "jobId": "550e8400-e29b-41d4-a716-446655440000"
}

MCP Output

When called through MCP (see src/mcp/mcpServer.ts:42):
🚀 Scraping job started with ID: 550e8400-e29b-41d4-a716-446655440000.

Error Cases

Invalid version format:
{
  "error": "Invalid version format for scraping: 'invalid'. Use 'X.Y.Z', 'X.Y.Z-prerelease', 'X.Y', 'X', or omit."
}
Invalid URL:
{
  "error": "URL must be a valid HTTP/HTTPS URL."
}

Version Format Support

The tool accepts and normalizes various version formats:
  • Full semver: "18.2.0""18.2.0"
  • Partial version: "18.2""18.2.0" (coerced)
  • Major only: "18""18.0.0" (coerced)
  • Unversioned: null or """" (empty string)
  • Prerelease: "18.3.0-rc.1""18.3.0-rc.1"
See src/tools/ScrapeTool.ts:98-124 for version normalization logic.

Scope Examples

Given starting URL https://react.dev/reference/react/hooks: subpages (default):
  • https://react.dev/reference/react/hooks/useState
  • https://react.dev/reference/react/hooks/useEffect
  • https://react.dev/learn (different path)
  • https://legacy.reactjs.org/ (different hostname)
hostname:
  • https://react.dev/reference/react/hooks
  • https://react.dev/learn
  • https://legacy.reactjs.org/ (different hostname)
domain:
  • https://react.dev/reference/react/hooks
  • https://legacy.react.dev/ (subdomain)
  • https://reactjs.org/ (different domain)

refresh_version

Re-scrape a previously indexed library version, updating only changed pages. MCP Tool Name: refresh_version Source: src/tools/RefreshTool.ts (uses same underlying ScrapeTool)

Parameters

library
string
required
Library name to refresh
version
string
Version to refresh (defaults to latest if omitted)

Response Structure

jobId
string
UUID of the refresh job

Example Request

{
  "name": "refresh_version",
  "arguments": {
    "library": "react",
    "version": "18.2.0"
  }
}

MCP Output

🔄 Refresh job started with ID: 550e8400-e29b-41d4-a716-446655440000.

Difference from scrape_docs

  • scrape_docs: Indexes from scratch (clears existing data by default)
  • refresh_version: Re-scrapes existing version, only updates changed pages

remove_docs

Remove indexed documentation for a library version. MCP Tool Name: remove_docs Source: src/tools/RemoveTool.ts
This is a destructive operation. All documents for the specified version will be permanently deleted.

Parameters

library
string
required
Library name
version
string
Version to remove (defaults to latest if omitted)

Response Structure

message
string
Confirmation message

Example Request

{
  "name": "remove_docs",
  "arguments": {
    "library": "react",
    "version": "17.0.2"
  }
}

MCP Output

Removed documentation for [email protected].

Error Cases

Library not found:
{
  "error": "Library 'nonexistent' not found in store."
}
Version not found:
{
  "error": "Version '99.0.0' not found for library 'react'."
}

Job Management

Since scraping operations are asynchronous, use the job management tools to monitor progress:

Example Workflow

// 1. Start scraping
const { jobId } = await scrapeDocs({
  library: "react",
  version: "18.2.0",
  url: "https://react.dev/"
});

// 2. Check progress
const { job } = await getJobInfo({ jobId });
console.log(`Status: ${job.status}, Progress: ${job.progress?.pages}/${job.progress?.totalPages}`);

// 3. Wait for completion
while (job.status === "running" || job.status === "queued") {
  await sleep(5000);
  const { job: updated } = await getJobInfo({ jobId });
  console.log(`Progress: ${updated.progress?.pages}/${updated.progress?.totalPages}`);
}

console.log(`Completed: ${job.status}`);

Build docs developers (and LLMs) love