Usage
crawl command is the core functionality of Crawlith. It performs a comprehensive website crawl, building an internal link graph, calculating metrics, and analyzing SEO structure. The crawl results are stored in the local database and can be visualized using the ui command.
Arguments
The URL to crawl. Must be a valid HTTP/HTTPS URL.
Core Options
Maximum number of pages to crawl. Use this to control crawl size.
Maximum click depth from the starting URL. Controls how deep the crawler traverses your site hierarchy.
Maximum number of concurrent requests. Increase for faster crawls on robust infrastructure.
Requests per second per host. Controls crawl politeness.
Output & Export Options
Output directory for exports. Used in conjunction with
--export.Export formats (comma-separated). Available formats:
json, markdown, csv, html, visualize.Output format for terminal display. Options:
pretty, json.Log verbosity level. Options:
normal, verbose, debug.Crawl Behavior
Strip query parameters from URLs during crawl. Useful for avoiding duplicate content from URL parameters.
Ignore robots.txt rules. Use responsibly and only on sites you own or have permission to crawl.
Perform an incremental crawl using the previous snapshot. Only crawls changed/new pages.
Sitemap URL to use for crawl discovery. If provided without a value, defaults to
/sitemap.xml.Network & Request Options
Maximum response size in bytes (default: 2MB). Prevents crawling of extremely large files.
Maximum number of redirect hops to follow.
Proxy URL for all requests (format:
http://user:pass@host:port).Custom User-Agent string for requests.
Domain Filtering
Whitelist of domains (comma-separated) to allow during crawl.
Blacklist of domains (comma-separated) to deny during crawl.
Include subdomains in the crawl scope.
Analysis Features
Enable orphan page detection. Identifies pages with few or no internal links.
Enable orphan severity scoring. Requires
--orphans to be enabled.Include soft orphan detection (pages with few inbound links).
Near-orphan threshold override. Pages with fewer than this many inbound links are flagged.
Detect soft 404 pages (pages that return 200 but contain 404-like content).
Detect and cluster crawl traps (infinite loops, pagination issues, etc.).
Hamming distance threshold for content cluster detection.
Minimum number of pages required to form a cluster.
Compute Hub and Authority scores using the HITS algorithm.
Do not collapse duplicate clusters before calculating PageRank.
Advanced Options
Print health score component weights and breakdown after crawl.
Exit with code 1 if critical issues are detected. Useful for CI/CD pipelines.
Force run and override existing process lock. Use with caution.
Internal: Compare two graph JSON files. Requires exactly two file paths.
Examples
Basic Crawl
Large Site Crawl with Exports
Comprehensive SEO Audit
Incremental Crawl for Large Sites
CI/CD Integration
After a successful crawl, use
crawlith ui https://example.com to explore the results in an interactive dashboard.