Usage
Arguments
Library name to use for indexing
URL or
file:// path to scrape. Can be:- HTTP/HTTPS URL:
https://react.dev/reference/react - Local file:
file:///Users/me/docs/index.html - Local directory:
file:///Users/me/docs/my-library
Options
Basic Options
Version of the library (optional). If not specified, the library will be indexed without a version.
Maximum number of pages to scrape. Prevents runaway crawls.
Maximum navigation depth from the starting URL. Links beyond this depth will not be followed.
Maximum number of concurrent page requests. Higher values speed up scraping but may overwhelm the server.
Clear existing documents before scraping. Set to
false to append to existing documentation.Crawling Options
Defines the crawling boundary:
subpages- Only crawl URLs under the starting URL pathhostname- Crawl any URL on the same hostnamedomain- Crawl any URL on the same domain (including subdomains)
Follow HTTP redirects during scraping. Use
--no-follow-redirects to disable.Continue scraping even if individual pages fail to load
Content Processing
HTML processing strategy:
auto- Automatically detect the best extraction methodmarkdown- Convert HTML to Markdownhtml- Use raw HTMLtext- Extract plain text only
URL Filtering
Glob or regex pattern for URLs to include. Can be specified multiple times.
Regex patterns must be wrapped in slashes (e.g.,
/pattern/).Example:Glob or regex pattern for URLs to exclude. Takes precedence over include patterns.
Can be specified multiple times. Regex patterns must be wrapped in slashes.Example:
Request Headers
Custom HTTP header to send with each request. Can be specified multiple times.
Format:
Header-Name: Header-ValueExample:Embedding Configuration
Embedding model configuration in format
provider:model-name.Examples:openai:text-embedding-3-smallvertex:text-embedding-004bedrock:amazon.titan-embed-text-v1azure:text-embedding-ada-002
Remote Worker
URL of external pipeline worker RPC endpoint.Example:
http://localhost:8080/apiWhen specified, the scraping job is sent to the remote worker instead of processing locally.Examples
Basic Scraping
Local Files
When using Docker, mount the local directory:
Advanced Crawling
Custom Headers and Authentication
Performance Tuning
Using Custom Embedding Models
Remote Worker
Output
The command displays progress during scraping:Behavior
Clean Mode (Default)
By default,--clean true removes all existing documents for the library/version before scraping:
Append Mode
Use--clean false to append to existing documentation:
URL Scope
The--scope option controls crawling boundaries:
Pattern Matching
Patterns support both glob and regex:Error Handling
Invalid URLSee Also
- refresh - Update existing documentation with changed pages
- list - View indexed libraries
- search - Query indexed documentation
- Embedding Models - Configure embedding providers
