Skip to main content

Overview

Crawlith uses a breadth-first search (BFS) algorithm to systematically discover and crawl web pages. The crawler respects robots.txt directives, supports sitemap discovery, and implements sophisticated rate limiting to ensure responsible crawling.

How It Works

Breadth-First Search Algorithm

The crawler starts from a seed URL and explores pages level by level, ensuring optimal discovery depth:
// From crawler.ts:190-202
private addToQueue(url: string, depth: number): void {
  if (this.scopeManager!.isUrlEligible(url) !== 'allowed') return;
  if (!this.uniqueQueue.has(url)) {
    this.uniqueQueue.add(url);
    this.queue.push({ url, depth });
    this.context.emit({ type: 'queue:enqueue', url, depth });

    const currentDiscovery = this.discoveryDepths.get(url);
    if (currentDiscovery === undefined || depth < currentDiscovery) {
      this.discoveryDepths.set(url, depth);
    }
  }
}
Why BFS matters: Pages at shallower depths are typically more important in a site’s information architecture. BFS ensures you discover critical pages first, even with page limits.

Robots.txt Compliance

Crawlith automatically fetches and parses robots.txt before crawling:
// From crawler.ts:164-175
async fetchRobots(): Promise<void> {
  try {
    const robotsUrl = new URL('/robots.txt', this.rootOrigin).toString();
    const res = await this.fetcher!.fetch(robotsUrl, { maxBytes: 500000 });
    if (res && typeof res.status === 'number' && res.status >= 200 && res.status < 300) {
      this.robots = robotsParser(robotsUrl, res.body);
    }
  } catch {
    console.warn('Failed to fetch robots.txt, proceeding...');
  }
}
The crawler checks both the exact path and path with trailing slash to handle normalization edge cases:
// From crawler.ts:568-571
const isBlocked = this.robots && (
  !this.robots.isAllowed(item.url, 'crawlith') ||
  (!item.url.endsWith('/') && !this.robots.isAllowed(item.url + '/', 'crawlith'))
);

Sitemap Support

Crawlith can seed the crawl queue from XML sitemaps, including sitemap indexes:
// From sitemap.ts:22-28
private async processSitemap(url: string, visited: Set<string>, urls: Set<string>) {
  if (visited.has(url)) return;
  visited.add(url);

  // Hard limit on number of sitemaps to fetch to prevent abuse
  if (visited.size > 50) return;
  // ...
}
Features:
  • Recursive sitemap index processing
  • Loop detection and depth limits (max 50 sitemaps)
  • Automatic URL normalization from sitemap entries

Rate Limiting

Crawlith implements token-bucket rate limiting per host with crawl-delay support:
// From rateLimiter.ts:9-16
async waitForToken(host: string, crawlDelay: number = 0): Promise<void> {
  const effectiveRate = crawlDelay > 0 ? Math.min(this.rate, 1 / crawlDelay) : this.rate;
  const interval = 1000 / effectiveRate;

  if (!this.buckets.has(host)) {
    this.buckets.set(host, { tokens: this.rate - 1, lastRefill: Date.now() });
    return;
  }
  // ...
}
Key features:
  • Per-host token buckets (prevents overwhelming specific servers)
  • Respects Crawl-delay directive from robots.txt
  • Smooth request distribution using token refill algorithm

CLI Usage

Basic Crawl

crawlith crawl https://example.com

Crawl with Sitemap

# Auto-discover sitemap.xml
crawlith crawl https://example.com --sitemap

# Use specific sitemap URL
crawlith crawl https://example.com --sitemap https://example.com/sitemap_index.xml

Configure Rate Limiting

# Limit to 5 requests per second per host
crawlith crawl https://example.com --rate 5

# Use higher concurrency (respects rate limit)
crawlith crawl https://example.com --rate 10 --concurrency 5

Ignore Robots.txt

crawlith crawl https://example.com --ignore-robots
Only use --ignore-robots on sites you own or have permission to crawl. Ignoring robots.txt may violate a site’s terms of service.

Control Crawl Scope

# Limit pages and depth
crawlith crawl https://example.com --limit 1000 --depth 4

# Include subdomains
crawlith crawl https://example.com --include-subdomains

# Allow specific domains
crawlith crawl https://example.com --allow "example.com,blog.example.com"

Advanced Options

Incremental Crawls

Use ETag and Last-Modified headers to skip unchanged pages

Trap Detection

Detect and avoid crawl traps with pattern analysis

Configuration Reference

OptionDefaultDescription
--limit500Maximum pages to crawl
--depth5Maximum click depth from start URL
--rate2Requests per second per host
--concurrency2Maximum concurrent requests
--max-redirects2Maximum redirect hops to follow
--max-bytes2000000Maximum response size (bytes)
--sitemap-Sitemap URL or true for auto-discovery
--ignore-robotsfalseSkip robots.txt checking
--include-subdomainsfalseCrawl subdomains of start URL

See Also

Graph Analysis

Analyze crawl results with PageRank and orphan detection

Export Data

Export crawl results to JSON, CSV, Markdown, or HTML

Build docs developers (and LLMs) love