Skip to main content

crawl()

The main entry point for starting a crawl. Returns a snapshot ID that can be used to load results.
function crawl(
  startUrl: string, 
  options: CrawlOptions, 
  context?: EngineContext
): Promise<number>
startUrl
string
required
The URL to start crawling from. Must be a valid HTTP or HTTPS URL.
options
CrawlOptions
required
Configuration object controlling crawler behavior. See CrawlOptions below.
context
EngineContext
Optional event context for monitoring crawl progress. Provides an emit function for receiving events.
snapshotId
number
A unique identifier for this crawl snapshot. Use it to load the graph and metrics later.

Example

import { crawl } from '@crawlith/core';

const snapshotId = await crawl('https://example.com', {
  limit: 100,
  depth: 3,
  concurrency: 5,
  detectSoft404: true
});

console.log('Snapshot ID:', snapshotId);

CrawlOptions

Configuration interface for controlling crawler behavior.
interface CrawlOptions {
  limit: number;
  depth: number;
  concurrency?: number;
  ignoreRobots?: boolean;
  stripQuery?: boolean;
  previousGraph?: Graph;
  sitemap?: string;
  debug?: boolean;
  detectSoft404?: boolean;
  detectTraps?: boolean;
  rate?: number;
  maxBytes?: number;
  allowedDomains?: string[];
  deniedDomains?: string[];
  includeSubdomains?: boolean;
  proxyUrl?: string;
  maxRedirects?: number;
  userAgent?: string;
  snapshotType?: 'full' | 'partial' | 'incremental';
}

Required Options

limit
number
required
Maximum number of pages to crawl. Once this limit is reached, the crawler stops.
{ limit: 500 }
depth
number
required
Maximum depth to crawl from the start URL. Pages beyond this depth are not fetched.
{ depth: 4 }

Concurrency & Rate Limiting

concurrency
number
default:"2"
Number of concurrent requests. Maximum is 10 for safety.
{ concurrency: 5 }
rate
number
Minimum delay in milliseconds between requests to the same domain.
{ rate: 1000 } // 1 second between requests

Scope Control

allowedDomains
string[]
Whitelist of domains to crawl. If set, only these domains are crawled.
{ allowedDomains: ['example.com', 'blog.example.com'] }
deniedDomains
string[]
Blacklist of domains to exclude from crawling.
{ deniedDomains: ['ads.example.com'] }
includeSubdomains
boolean
default:"false"
Whether to include subdomains of the start URL’s domain.
{ includeSubdomains: true }

URL Processing

stripQuery
boolean
default:"false"
Remove query parameters from URLs before processing. Useful for deduplication.
{ stripQuery: true }
// https://example.com/page?ref=123 → https://example.com/page
maxRedirects
number
Maximum number of redirects to follow per URL.
{ maxRedirects: 5 }

Detection Features

detectSoft404
boolean
default:"false"
Enable soft 404 detection for pages that return 200 but contain error content.
{ detectSoft404: true }
detectTraps
boolean
default:"false"
Enable crawl trap detection to avoid infinite crawl loops.
{ detectTraps: true }

Sitemap & Robots

sitemap
string
URL to sitemap.xml or 'true' to auto-discover at /sitemap.xml.
{ sitemap: 'true' } // Auto-discover
{ sitemap: 'https://example.com/sitemap.xml' } // Explicit URL
ignoreRobots
boolean
default:"false"
Ignore robots.txt restrictions. Use responsibly.
{ ignoreRobots: true }

Advanced Options

maxBytes
number
Maximum response size in bytes. Larger responses are truncated.
{ maxBytes: 5000000 } // 5MB limit
proxyUrl
string
HTTP proxy URL for all requests.
{ proxyUrl: 'http://proxy.example.com:8080' }
userAgent
string
Custom User-Agent header for requests.
{ userAgent: 'MyCrawler/1.0' }
snapshotType
'full' | 'partial' | 'incremental'
Type of snapshot to create. Incremental snapshots compare against previousGraph.
{ snapshotType: 'incremental' }
previousGraph
Graph
Graph from a previous crawl for incremental comparison.
const prevGraph = loadGraphFromSnapshot(previousSnapshotId);
const newSnapshot = await crawl(url, { 
  limit: 1000, 
  depth: 4,
  previousGraph: prevGraph,
  snapshotType: 'incremental'
});
debug
boolean
default:"false"
Enable debug output to console.
{ debug: true }

Crawler Class

The underlying Crawler class that crawl() uses. You can instantiate it directly for more control.
class Crawler {
  constructor(
    startUrl: string, 
    options: CrawlOptions, 
    context?: EngineContext
  )
  
  async run(): Promise<number>
}

Example

import { Crawler } from '@crawlith/core';

const crawler = new Crawler('https://example.com', {
  limit: 100,
  depth: 3,
  concurrency: 5
});

const snapshotId = await crawler.run();
console.log('Crawl complete:', snapshotId);

Event Context

Provide an EngineContext to receive real-time events during the crawl:
interface EngineContext {
  emit: (event: CrawlEvent) => void;
}

Event Types

crawl:start
object
Emitted when a URL starts being fetched.
{ type: 'crawl:start', url: string }
crawl:success
object
Emitted when a URL is successfully fetched.
{ 
  type: 'crawl:success', 
  url: string, 
  status: number, 
  durationMs: number,
  depth: number 
}
crawl:error
object
Emitted when a URL fetch fails.
{ 
  type: 'crawl:error', 
  url: string, 
  error: string,
  depth: number 
}
crawl:limit-reached
object
Emitted when the crawl limit is reached.
{ type: 'crawl:limit-reached', limit: number }
queue:enqueue
object
Emitted when a URL is added to the crawl queue.
{ type: 'queue:enqueue', url: string, depth: number }

Example with Events

const context = {
  emit: (event: any) => {
    if (event.type === 'crawl:success') {
      console.log(`✓ ${event.url} [${event.status}] ${event.durationMs}ms`);
    }
  }
};

await crawl('https://example.com', { limit: 50, depth: 2 }, context);

Build docs developers (and LLMs) love