Skip to main content

Usage

crawlith crawl [url] [options]
The crawl command is the core functionality of Crawlith. It performs a comprehensive website crawl, building an internal link graph, calculating metrics, and analyzing SEO structure. The crawl results are stored in the local database and can be visualized using the ui command.

Arguments

url
string
required
The URL to crawl. Must be a valid HTTP/HTTPS URL.

Core Options

--limit
number
default:"500"
Maximum number of pages to crawl. Use this to control crawl size.
crawlith crawl https://example.com --limit 1000
--depth
number
default:"5"
Maximum click depth from the starting URL. Controls how deep the crawler traverses your site hierarchy.
crawlith crawl https://example.com --depth 3
--concurrency
number
default:"2"
Maximum number of concurrent requests. Increase for faster crawls on robust infrastructure.
crawlith crawl https://example.com --concurrency 5
--rate
number
default:"2"
Requests per second per host. Controls crawl politeness.
crawlith crawl https://example.com --rate 5

Output & Export Options

--output
string
default:"./crawlith-reports"
Output directory for exports. Used in conjunction with --export.
crawlith crawl https://example.com --output ./reports
--export
string
Export formats (comma-separated). Available formats: json, markdown, csv, html, visualize.
crawlith crawl https://example.com --export json,html,markdown
--format
string
default:"pretty"
Output format for terminal display. Options: pretty, json.
crawlith crawl https://example.com --format json
--log-level
string
default:"normal"
Log verbosity level. Options: normal, verbose, debug.
crawlith crawl https://example.com --log-level verbose

Crawl Behavior

--no-query
boolean
Strip query parameters from URLs during crawl. Useful for avoiding duplicate content from URL parameters.
crawlith crawl https://example.com --no-query
--ignore-robots
boolean
Ignore robots.txt rules. Use responsibly and only on sites you own or have permission to crawl.
crawlith crawl https://example.com --ignore-robots
--incremental
boolean
Perform an incremental crawl using the previous snapshot. Only crawls changed/new pages.
crawlith crawl https://example.com --incremental
--sitemap
string
Sitemap URL to use for crawl discovery. If provided without a value, defaults to /sitemap.xml.
# Use default sitemap.xml
crawlith crawl https://example.com --sitemap

# Use custom sitemap URL
crawlith crawl https://example.com --sitemap https://example.com/sitemap_index.xml

Network & Request Options

--max-bytes
number
default:"2000000"
Maximum response size in bytes (default: 2MB). Prevents crawling of extremely large files.
crawlith crawl https://example.com --max-bytes 5000000
--max-redirects
number
default:"2"
Maximum number of redirect hops to follow.
crawlith crawl https://example.com --max-redirects 5
--proxy
string
Proxy URL for all requests (format: http://user:pass@host:port).
crawlith crawl https://example.com --proxy http://proxy.example.com:8080
--ua
string
Custom User-Agent string for requests.
crawlith crawl https://example.com --ua "MyBot/1.0"

Domain Filtering

--allow
string
Whitelist of domains (comma-separated) to allow during crawl.
crawlith crawl https://example.com --allow example.com,cdn.example.com
--deny
string
Blacklist of domains (comma-separated) to deny during crawl.
crawlith crawl https://example.com --deny ads.example.com,tracking.example.com
--include-subdomains
boolean
Include subdomains in the crawl scope.
crawlith crawl https://example.com --include-subdomains

Analysis Features

--orphans
boolean
Enable orphan page detection. Identifies pages with few or no internal links.
crawlith crawl https://example.com --orphans
--orphan-severity
boolean
Enable orphan severity scoring. Requires --orphans to be enabled.
crawlith crawl https://example.com --orphans --orphan-severity
--include-soft-orphans
boolean
Include soft orphan detection (pages with few inbound links).
crawlith crawl https://example.com --orphans --include-soft-orphans
--min-inbound
number
default:"2"
Near-orphan threshold override. Pages with fewer than this many inbound links are flagged.
crawlith crawl https://example.com --orphans --min-inbound 3
--detect-soft404
boolean
Detect soft 404 pages (pages that return 200 but contain 404-like content).
crawlith crawl https://example.com --detect-soft404
--detect-traps
boolean
Detect and cluster crawl traps (infinite loops, pagination issues, etc.).
crawlith crawl https://example.com --detect-traps
--cluster-threshold
number
default:"10"
Hamming distance threshold for content cluster detection.
crawlith crawl https://example.com --cluster-threshold 15
--min-cluster-size
number
default:"3"
Minimum number of pages required to form a cluster.
crawlith crawl https://example.com --min-cluster-size 5
--compute-hits
boolean
Compute Hub and Authority scores using the HITS algorithm.
crawlith crawl https://example.com --compute-hits
--no-collapse
boolean
Do not collapse duplicate clusters before calculating PageRank.
crawlith crawl https://example.com --no-collapse

Advanced Options

--score-breakdown
boolean
Print health score component weights and breakdown after crawl.
crawlith crawl https://example.com --score-breakdown
--fail-on-critical
boolean
Exit with code 1 if critical issues are detected. Useful for CI/CD pipelines.
crawlith crawl https://example.com --fail-on-critical
--force
boolean
Force run and override existing process lock. Use with caution.
crawlith crawl https://example.com --force
--compare
string[]
Internal: Compare two graph JSON files. Requires exactly two file paths.
crawlith crawl --compare old.json new.json

Examples

Basic Crawl

crawlith crawl https://example.com

Large Site Crawl with Exports

crawlith crawl https://example.com \
  --limit 5000 \
  --depth 10 \
  --concurrency 5 \
  --export json,html,markdown

Comprehensive SEO Audit

crawlith crawl https://example.com \
  --orphans \
  --orphan-severity \
  --detect-soft404 \
  --detect-traps \
  --compute-hits \
  --score-breakdown

Incremental Crawl for Large Sites

crawlith crawl https://example.com \
  --incremental \
  --sitemap \
  --limit 10000

CI/CD Integration

crawlith crawl https://staging.example.com \
  --fail-on-critical \
  --orphans \
  --detect-soft404 \
  --format json > report.json
After a successful crawl, use crawlith ui https://example.com to explore the results in an interactive dashboard.
Use --ignore-robots responsibly and only on sites you own or have explicit permission to crawl. Respecting robots.txt is an important web crawling convention.
For large sites, start with a smaller --limit to test your configuration before running a full crawl. You can always use --incremental later to complete the crawl.

Build docs developers (and LLMs) love