Overview
Crawlith can be configured through command-line options to control crawling behavior, output formats, security boundaries, and export settings. All configuration happens at runtime through the CLI.Command-Line Options
Core Crawling Options
| Option | Type | Default | Description |
|---|---|---|---|
-l, --limit <number> | number | 500 | Maximum number of pages to crawl |
-d, --depth <number> | number | 5 | Maximum click depth from start URL |
-c, --concurrency <number> | number | 2 | Maximum concurrent HTTP requests |
--no-query | boolean | false | Strip query parameters during normalization |
--ignore-robots | boolean | false | Ignore robots.txt rules |
--incremental | boolean | false | Use ETag/Last-Modified for efficient re-crawling |
Sitemap Integration
Detection & Analysis
| Option | Description |
|---|---|
--detect-soft404 | Detect soft 404 pages using content heuristics |
--detect-traps | Identify crawl traps (infinite parameter spaces) |
--orphans | Enable orphan page detection |
--orphan-severity | Calculate severity scores for orphan pages |
--include-soft-orphans | Include near-orphans in detection |
--min-inbound <number> | Minimum inbound links to avoid near-orphan status (default: 2) |
--compute-hits | Calculate Hub and Authority scores using HITS algorithm |
Content Clustering
--cluster-threshold <number>: Hamming distance for SimHash clustering (default:10)--min-cluster-size <number>: Minimum pages to form a cluster (default:3)--no-collapse: Prevent collapsing duplicate clusters before PageRank calculation
Output & Logging
| Option | Values | Default | Description |
|---|---|---|---|
--format <type> | pretty, json | pretty | Output format |
--log-level <level> | normal, verbose, debug | normal | Logging verbosity |
-o, --output <path> | path | ./crawlith-reports | Export directory |
--score-breakdown | boolean | false | Display health score component weights |
--fail-on-critical | boolean | false | Exit with code 1 if critical issues found |
Export Formats
json- Full graph and metrics datahtml- Interactive HTML reportcsv- Tabular data exportmarkdown- Human-readable documentationvisualize- D3.js force-directed graph
Configuration Examples
Basic Site Audit
Fast Shallow Crawl
Deep Analysis with All Features
Incremental Re-Crawl
If-None-Match and If-Modified-Since headers, receiving HTTP 304 responses for unchanged pages.
Security-Focused Configuration
Storage Location
All crawl data is stored in a persistent SQLite database:User-Agent Configuration
crawlith/1.0
Comparison Mode
Compare two previously exported graph JSON files:- Added URLs
- Removed URLs
- Status changes
- Metric deltas (PageRank, health scores, etc.)
Best Practices
How many pages should I crawl?
How many pages should I crawl?
Start with
--limit 500 for initial exploration. Increase to 1000-5000 for comprehensive audits. Very large sites (10,000+ pages) should be crawled in segments or with higher depth limits to prioritize important pages.What concurrency should I use?
What concurrency should I use?
Default
--concurrency 2 is safe for most sites. Increase to 5-10 for fast servers you control. Always respect rate limits with --rate (see Rate Limiting).Should I enable all detection features?
Should I enable all detection features?
Enable features based on your needs:
--orphans: Always recommended for SEO audits--detect-soft404: If you suspect thin content pages--detect-traps: For dynamic sites with URL parameters--compute-hits: For advanced link analysis (adds processing time)
How do I crawl only specific domains?
How do I crawl only specific domains?
Use This ensures only these domains are fetched, even if linked from crawled pages.
--allow to create a whitelist:Next Steps
- Configure Rate Limiting for respectful crawling
- Set up Security boundaries for safe operation
- Explore Troubleshooting for common issues