crawl()
The main entry point for starting a crawl. Returns a snapshot ID that can be used to load results.The URL to start crawling from. Must be a valid HTTP or HTTPS URL.
Configuration object controlling crawler behavior. See CrawlOptions below.
Optional event context for monitoring crawl progress. Provides an
emit function for receiving events.A unique identifier for this crawl snapshot. Use it to load the graph and metrics later.
Example
CrawlOptions
Configuration interface for controlling crawler behavior.Required Options
Maximum number of pages to crawl. Once this limit is reached, the crawler stops.
Maximum depth to crawl from the start URL. Pages beyond this depth are not fetched.
Concurrency & Rate Limiting
Number of concurrent requests. Maximum is 10 for safety.
Minimum delay in milliseconds between requests to the same domain.
Scope Control
Whitelist of domains to crawl. If set, only these domains are crawled.
Blacklist of domains to exclude from crawling.
Whether to include subdomains of the start URL’s domain.
URL Processing
Remove query parameters from URLs before processing. Useful for deduplication.
Maximum number of redirects to follow per URL.
Detection Features
Enable soft 404 detection for pages that return 200 but contain error content.
Enable crawl trap detection to avoid infinite crawl loops.
Sitemap & Robots
URL to sitemap.xml or
'true' to auto-discover at /sitemap.xml.Ignore robots.txt restrictions. Use responsibly.
Advanced Options
Maximum response size in bytes. Larger responses are truncated.
HTTP proxy URL for all requests.
Custom User-Agent header for requests.
Type of snapshot to create. Incremental snapshots compare against
previousGraph.Graph from a previous crawl for incremental comparison.
Enable debug output to console.
Crawler Class
The underlying Crawler class thatcrawl() uses. You can instantiate it directly for more control.
Example
Event Context
Provide anEngineContext to receive real-time events during the crawl:
Event Types
Emitted when a URL starts being fetched.
Emitted when a URL is successfully fetched.
Emitted when a URL fetch fails.
Emitted when the crawl limit is reached.
Emitted when a URL is added to the crawl queue.