crawl

Usage

crawlith crawl [url] [options]

The crawl command is the core functionality of Crawlith. It performs a comprehensive website crawl, building an internal link graph, calculating metrics, and analyzing SEO structure. The crawl results are stored in the local database and can be visualized using the ui command.

Arguments

url

string

required

The URL to crawl. Must be a valid HTTP/HTTPS URL.

Core Options

--limit

number

default:"500"

Maximum number of pages to crawl. Use this to control crawl size.

crawlith crawl https://example.com --limit 1000

--depth

number

default:"5"

Maximum click depth from the starting URL. Controls how deep the crawler traverses your site hierarchy.

crawlith crawl https://example.com --depth 3

--concurrency

number

default:"2"

Maximum number of concurrent requests. Increase for faster crawls on robust infrastructure.

crawlith crawl https://example.com --concurrency 5

--rate

number

default:"2"

Requests per second per host. Controls crawl politeness.

crawlith crawl https://example.com --rate 5

Output & Export Options

--output

string

default:"./crawlith-reports"

Output directory for exports. Used in conjunction with --export.

crawlith crawl https://example.com --output ./reports

--export

string

Export formats (comma-separated). Available formats: json, markdown, csv, html, visualize.

crawlith crawl https://example.com --export json,html,markdown

--format

string

default:"pretty"

Output format for terminal display. Options: pretty, json.

crawlith crawl https://example.com --format json

--log-level

string

default:"normal"

Log verbosity level. Options: normal, verbose, debug.

crawlith crawl https://example.com --log-level verbose

Crawl Behavior

--no-query

boolean

Strip query parameters from URLs during crawl. Useful for avoiding duplicate content from URL parameters.

crawlith crawl https://example.com --no-query

--ignore-robots

boolean

Ignore robots.txt rules. Use responsibly and only on sites you own or have permission to crawl.

crawlith crawl https://example.com --ignore-robots

--incremental

boolean

Perform an incremental crawl using the previous snapshot. Only crawls changed/new pages.

crawlith crawl https://example.com --incremental

--sitemap

string

Sitemap URL to use for crawl discovery. If provided without a value, defaults to /sitemap.xml.

# Use default sitemap.xml
crawlith crawl https://example.com --sitemap

# Use custom sitemap URL
crawlith crawl https://example.com --sitemap https://example.com/sitemap_index.xml

Network & Request Options

--max-bytes

number

default:"2000000"

Maximum response size in bytes (default: 2MB). Prevents crawling of extremely large files.

crawlith crawl https://example.com --max-bytes 5000000

--max-redirects

number

default:"2"

Maximum number of redirect hops to follow.

crawlith crawl https://example.com --max-redirects 5

--proxy

string

Proxy URL for all requests (format: http://user:pass@host:port).

crawlith crawl https://example.com --proxy http://proxy.example.com:8080

--ua

string

Custom User-Agent string for requests.

crawlith crawl https://example.com --ua "MyBot/1.0"

Domain Filtering

--allow

string

Whitelist of domains (comma-separated) to allow during crawl.

crawlith crawl https://example.com --allow example.com,cdn.example.com

--deny

string

Blacklist of domains (comma-separated) to deny during crawl.

crawlith crawl https://example.com --deny ads.example.com,tracking.example.com

--include-subdomains

boolean

Include subdomains in the crawl scope.

crawlith crawl https://example.com --include-subdomains

Analysis Features

--orphans

boolean

Enable orphan page detection. Identifies pages with few or no internal links.

crawlith crawl https://example.com --orphans

--orphan-severity

boolean

Enable orphan severity scoring. Requires --orphans to be enabled.

crawlith crawl https://example.com --orphans --orphan-severity

--include-soft-orphans

boolean

Include soft orphan detection (pages with few inbound links).

crawlith crawl https://example.com --orphans --include-soft-orphans

--min-inbound

number

default:"2"

Near-orphan threshold override. Pages with fewer than this many inbound links are flagged.

crawlith crawl https://example.com --orphans --min-inbound 3

--detect-soft404

boolean

Detect soft 404 pages (pages that return 200 but contain 404-like content).

crawlith crawl https://example.com --detect-soft404

--detect-traps

boolean

Detect and cluster crawl traps (infinite loops, pagination issues, etc.).

crawlith crawl https://example.com --detect-traps

--cluster-threshold

number

default:"10"

Hamming distance threshold for content cluster detection.

crawlith crawl https://example.com --cluster-threshold 15

--min-cluster-size

number

default:"3"

Minimum number of pages required to form a cluster.

crawlith crawl https://example.com --min-cluster-size 5

--compute-hits

boolean

Compute Hub and Authority scores using the HITS algorithm.

crawlith crawl https://example.com --compute-hits

--no-collapse

boolean

Do not collapse duplicate clusters before calculating PageRank.

crawlith crawl https://example.com --no-collapse

Advanced Options

--score-breakdown

boolean

Print health score component weights and breakdown after crawl.

crawlith crawl https://example.com --score-breakdown

--fail-on-critical

boolean

Exit with code 1 if critical issues are detected. Useful for CI/CD pipelines.

crawlith crawl https://example.com --fail-on-critical

--force

boolean

Force run and override existing process lock. Use with caution.

crawlith crawl https://example.com --force

--compare

string[]

Internal: Compare two graph JSON files. Requires exactly two file paths.

crawlith crawl --compare old.json new.json

Examples

Basic Crawl

crawlith crawl https://example.com

Large Site Crawl with Exports

crawlith crawl https://example.com \
  --limit 5000 \
  --depth 10 \
  --concurrency 5 \
  --export json,html,markdown

Comprehensive SEO Audit

crawlith crawl https://example.com \
  --orphans \
  --orphan-severity \
  --detect-soft404 \
  --detect-traps \
  --compute-hits \
  --score-breakdown

Incremental Crawl for Large Sites

crawlith crawl https://example.com \
  --incremental \
  --sitemap \
  --limit 10000

CI/CD Integration

crawlith crawl https://staging.example.com \
  --fail-on-critical \
  --orphans \
  --detect-soft404 \
  --format json > report.json

After a successful crawl, use crawlith ui https://example.com to explore the results in an interactive dashboard.

Use --ignore-robots responsibly and only on sites you own or have explicit permission to crawl. Respecting robots.txt is an important web crawling convention.

For large sites, start with a smaller --limit to test your configuration before running a full crawl. You can always use --incremental later to complete the crawl.

Get Started

Core Commands

Features

Guides

Usage

Arguments

Core Options

Output & Export Options

Crawl Behavior

Network & Request Options

Domain Filtering

Analysis Features

Advanced Options

Examples

Basic Crawl

Large Site Crawl with Exports

Comprehensive SEO Audit

Incremental Crawl for Large Sites

CI/CD Integration

Build docs developers (and LLMs) love

Get Started

Core Commands

Features

Guides

​Usage

​Arguments

​Core Options

​Output & Export Options

​Crawl Behavior

​Network & Request Options

​Domain Filtering

​Analysis Features

​Advanced Options

​Examples

​Basic Crawl

​Large Site Crawl with Exports

​Comprehensive SEO Audit

​Incremental Crawl for Large Sites

​CI/CD Integration

Build docs developers (and LLMs) love

Usage

Arguments

Core Options

Output & Export Options

Crawl Behavior

Network & Request Options

Domain Filtering

Analysis Features

Advanced Options

Examples

Basic Crawl

Large Site Crawl with Exports

Comprehensive SEO Audit

Incremental Crawl for Large Sites

CI/CD Integration