Skip to main content

Overview

Crawlith can be configured through command-line options to control crawling behavior, output formats, security boundaries, and export settings. All configuration happens at runtime through the CLI.

Command-Line Options

Core Crawling Options

crawlith crawl https://example.com [options]
OptionTypeDefaultDescription
-l, --limit <number>number500Maximum number of pages to crawl
-d, --depth <number>number5Maximum click depth from start URL
-c, --concurrency <number>number2Maximum concurrent HTTP requests
--no-querybooleanfalseStrip query parameters during normalization
--ignore-robotsbooleanfalseIgnore robots.txt rules
--incrementalbooleanfalseUse ETag/Last-Modified for efficient re-crawling

Sitemap Integration

# Auto-discover sitemap at /sitemap.xml
crawlith crawl https://example.com --sitemap

# Use custom sitemap URL
crawlith crawl https://example.com --sitemap https://example.com/sitemap_index.xml
Sitemap URLs are crawled with depth 0 and added to the initial queue.

Detection & Analysis

OptionDescription
--detect-soft404Detect soft 404 pages using content heuristics
--detect-trapsIdentify crawl traps (infinite parameter spaces)
--orphansEnable orphan page detection
--orphan-severityCalculate severity scores for orphan pages
--include-soft-orphansInclude near-orphans in detection
--min-inbound <number>Minimum inbound links to avoid near-orphan status (default: 2)
--compute-hitsCalculate Hub and Authority scores using HITS algorithm

Content Clustering

crawlith crawl https://example.com \
  --cluster-threshold 10 \
  --min-cluster-size 3 \
  --no-collapse
  • --cluster-threshold <number>: Hamming distance for SimHash clustering (default: 10)
  • --min-cluster-size <number>: Minimum pages to form a cluster (default: 3)
  • --no-collapse: Prevent collapsing duplicate clusters before PageRank calculation

Output & Logging

# JSON output for programmatic parsing
crawlith crawl https://example.com --format json

# Debug mode with detailed logs
crawlith crawl https://example.com --log-level debug
OptionValuesDefaultDescription
--format <type>pretty, jsonprettyOutput format
--log-level <level>normal, verbose, debugnormalLogging verbosity
-o, --output <path>path./crawlith-reportsExport directory
--score-breakdownbooleanfalseDisplay health score component weights
--fail-on-criticalbooleanfalseExit with code 1 if critical issues found

Export Formats

# Export multiple formats during crawl
crawlith crawl https://example.com --export json,html,csv,markdown,visualize

# Or export after crawl completes
crawlith export https://example.com --export json,html
Available formats:
  • json - Full graph and metrics data
  • html - Interactive HTML report
  • csv - Tabular data export
  • markdown - Human-readable documentation
  • visualize - D3.js force-directed graph

Configuration Examples

Basic Site Audit

crawlith crawl https://example.com \
  --limit 1000 \
  --depth 10 \
  --orphans \
  --detect-soft404 \
  --export html,markdown

Fast Shallow Crawl

crawlith crawl https://example.com \
  --limit 100 \
  --depth 2 \
  --concurrency 5 \
  --rate 5 \
  --format json

Deep Analysis with All Features

crawlith crawl https://example.com \
  --limit 5000 \
  --depth 15 \
  --orphans \
  --orphan-severity \
  --detect-soft404 \
  --detect-traps \
  --compute-hits \
  --export json,html,visualize \
  --fail-on-critical

Incremental Re-Crawl

# First crawl
crawlith crawl https://example.com --limit 1000

# Later re-crawl (uses ETag/Last-Modified caching)
crawlith crawl https://example.com --limit 1000 --incremental
Incremental mode sends If-None-Match and If-Modified-Since headers, receiving HTTP 304 responses for unchanged pages.

Security-Focused Configuration

crawlith crawl https://example.com \
  --allow example.com,cdn.example.com \
  --deny api.example.com \
  --max-bytes 1000000 \
  --max-redirects 2 \
  --rate 1
See Security for detailed information on security features.

Storage Location

All crawl data is stored in a persistent SQLite database:
~/.crawlith/crawlith.db
Lock files are stored in:
~/.crawlith/locks/
See Troubleshooting for lock file issues.

User-Agent Configuration

# Custom User-Agent string
crawlith crawl https://example.com --ua "MyBot/1.0"
Default User-Agent: crawlith/1.0

Comparison Mode

Compare two previously exported graph JSON files:
crawlith crawl --compare old.json new.json
Outputs:
  • Added URLs
  • Removed URLs
  • Status changes
  • Metric deltas (PageRank, health scores, etc.)

Best Practices

Start with --limit 500 for initial exploration. Increase to 1000-5000 for comprehensive audits. Very large sites (10,000+ pages) should be crawled in segments or with higher depth limits to prioritize important pages.
Default --concurrency 2 is safe for most sites. Increase to 5-10 for fast servers you control. Always respect rate limits with --rate (see Rate Limiting).
Enable features based on your needs:
  • --orphans: Always recommended for SEO audits
  • --detect-soft404: If you suspect thin content pages
  • --detect-traps: For dynamic sites with URL parameters
  • --compute-hits: For advanced link analysis (adds processing time)
Use --allow to create a whitelist:
crawlith crawl https://example.com --allow example.com,cdn.example.com
This ensures only these domains are fetched, even if linked from crawled pages.

Next Steps

Build docs developers (and LLMs) love