Crawler API

crawl()

The main entry point for starting a crawl. Returns a snapshot ID that can be used to load results.

function crawl(
  startUrl: string, 
  options: CrawlOptions, 
  context?: EngineContext
): Promise<number>

startUrl

string

required

The URL to start crawling from. Must be a valid HTTP or HTTPS URL.

options

CrawlOptions

required

Configuration object controlling crawler behavior. See CrawlOptions below.

context

EngineContext

Optional event context for monitoring crawl progress. Provides an emit function for receiving events.

snapshotId

number

A unique identifier for this crawl snapshot. Use it to load the graph and metrics later.

Example

import { crawl } from '@crawlith/core';

const snapshotId = await crawl('https://example.com', {
  limit: 100,
  depth: 3,
  concurrency: 5,
  detectSoft404: true
});

console.log('Snapshot ID:', snapshotId);

CrawlOptions

Configuration interface for controlling crawler behavior.

interface CrawlOptions {
  limit: number;
  depth: number;
  concurrency?: number;
  ignoreRobots?: boolean;
  stripQuery?: boolean;
  previousGraph?: Graph;
  sitemap?: string;
  debug?: boolean;
  detectSoft404?: boolean;
  detectTraps?: boolean;
  rate?: number;
  maxBytes?: number;
  allowedDomains?: string[];
  deniedDomains?: string[];
  includeSubdomains?: boolean;
  proxyUrl?: string;
  maxRedirects?: number;
  userAgent?: string;
  snapshotType?: 'full' | 'partial' | 'incremental';
}

Required Options

limit

number

required

Maximum number of pages to crawl. Once this limit is reached, the crawler stops.

{ limit: 500 }

depth

number

required

Maximum depth to crawl from the start URL. Pages beyond this depth are not fetched.

{ depth: 4 }

Concurrency & Rate Limiting

concurrency

number

default:"2"

Number of concurrent requests. Maximum is 10 for safety.

{ concurrency: 5 }

rate

number

Minimum delay in milliseconds between requests to the same domain.

{ rate: 1000 } // 1 second between requests

Scope Control

allowedDomains

string[]

Whitelist of domains to crawl. If set, only these domains are crawled.

{ allowedDomains: ['example.com', 'blog.example.com'] }

deniedDomains

string[]

Blacklist of domains to exclude from crawling.

{ deniedDomains: ['ads.example.com'] }

includeSubdomains

boolean

default:"false"

Whether to include subdomains of the start URL’s domain.

{ includeSubdomains: true }

URL Processing

stripQuery

boolean

default:"false"

Remove query parameters from URLs before processing. Useful for deduplication.

{ stripQuery: true }
// https://example.com/page?ref=123 → https://example.com/page

maxRedirects

number

Maximum number of redirects to follow per URL.

{ maxRedirects: 5 }

Detection Features

detectSoft404

boolean

default:"false"

Enable soft 404 detection for pages that return 200 but contain error content.

{ detectSoft404: true }

detectTraps

boolean

default:"false"

Enable crawl trap detection to avoid infinite crawl loops.

{ detectTraps: true }

Sitemap & Robots

sitemap

string

URL to sitemap.xml or 'true' to auto-discover at /sitemap.xml.

{ sitemap: 'true' } // Auto-discover
{ sitemap: 'https://example.com/sitemap.xml' } // Explicit URL

ignoreRobots

boolean

default:"false"

Ignore robots.txt restrictions. Use responsibly.

{ ignoreRobots: true }

Advanced Options

maxBytes

number

Maximum response size in bytes. Larger responses are truncated.

{ maxBytes: 5000000 } // 5MB limit

proxyUrl

string

HTTP proxy URL for all requests.

{ proxyUrl: 'http://proxy.example.com:8080' }

userAgent

string

Custom User-Agent header for requests.

{ userAgent: 'MyCrawler/1.0' }

snapshotType

'full' | 'partial' | 'incremental'

Type of snapshot to create. Incremental snapshots compare against previousGraph.

{ snapshotType: 'incremental' }

previousGraph

Graph

Graph from a previous crawl for incremental comparison.

const prevGraph = loadGraphFromSnapshot(previousSnapshotId);
const newSnapshot = await crawl(url, { 
  limit: 1000, 
  depth: 4,
  previousGraph: prevGraph,
  snapshotType: 'incremental'
});

debug

boolean

default:"false"

Enable debug output to console.

{ debug: true }

Crawler Class

The underlying Crawler class that crawl() uses. You can instantiate it directly for more control.

class Crawler {
  constructor(
    startUrl: string, 
    options: CrawlOptions, 
    context?: EngineContext
  )
  
  async run(): Promise<number>
}

Example

import { Crawler } from '@crawlith/core';

const crawler = new Crawler('https://example.com', {
  limit: 100,
  depth: 3,
  concurrency: 5
});

const snapshotId = await crawler.run();
console.log('Crawl complete:', snapshotId);

Event Context

Provide an EngineContext to receive real-time events during the crawl:

interface EngineContext {
  emit: (event: CrawlEvent) => void;
}

Event Types

crawl:start

object

Emitted when a URL starts being fetched.

{ type: 'crawl:start', url: string }

crawl:success

object

Emitted when a URL is successfully fetched.

{ 
  type: 'crawl:success', 
  url: string, 
  status: number, 
  durationMs: number,
  depth: number 
}

crawl:error

object

Emitted when a URL fetch fails.

{ 
  type: 'crawl:error', 
  url: string, 
  error: string,
  depth: number 
}

crawl:limit-reached

object

Emitted when the crawl limit is reached.

{ type: 'crawl:limit-reached', limit: number }

queue:enqueue

object

Emitted when a URL is added to the crawl queue.

{ type: 'queue:enqueue', url: string, depth: number }

Example with Events

const context = {
  emit: (event: any) => {
    if (event.type === 'crawl:success') {
      console.log(`✓ ${event.url} [${event.status}] ${event.durationMs}ms`);
    }
  }
};

await crawl('https://example.com', { limit: 50, depth: 2 }, context);

Core Library

Database

crawl()

Example

CrawlOptions

Required Options

Concurrency & Rate Limiting

Scope Control

URL Processing

Detection Features

Sitemap & Robots

Advanced Options

Crawler Class

Example

Event Context

Event Types

Example with Events

Build docs developers (and LLMs) love

Core Library

Database

​crawl()

​Example

​CrawlOptions

​Required Options

​Concurrency & Rate Limiting

​Scope Control

​URL Processing

​Detection Features

​Sitemap & Robots

​Advanced Options

​Crawler Class

​Example

​Event Context

​Event Types

​Example with Events

Build docs developers (and LLMs) love

crawl()

Example

CrawlOptions

Required Options

Concurrency & Rate Limiting

Scope Control

URL Processing

Detection Features

Sitemap & Robots

Advanced Options

Crawler Class

Example

Event Context

Event Types

Example with Events