Troubleshooting

Overview

This guide covers common issues when using Crawlith, debugging techniques, and solutions to frequently encountered problems.

Debug Mode

Enable detailed logging for troubleshooting:

crawlith crawl https://example.com --log-level debug

Debug output includes:

Individual URL fetch status
Robots.txt blocking decisions
Security policy evaluations
Rate limiting behavior
Error stack traces

Verbose Mode

For less detailed output:

crawlith crawl https://example.com --log-level verbose

Verbose mode shows:

Session statistics (fetched, cached, skipped pages)
Timing information
Queue status

Common Issues

No Pages Crawled

Symptom:

✅ Crawl complete.
❌ No pages were crawled.

Causes:

Robots.txt blocking Check if your crawler is blocked:
```
curl https://example.com/robots.txt
```
Look for:
```
User-agent: crawlith
Disallow: /
```
Solution:
```
crawlith crawl https://example.com --ignore-robots
```
Only use --ignore-robots on sites you own or have permission to crawl.
SSRF Protection triggered If crawling localhost, 127.0.0.1, or internal IPs:
```
status: 'blocked_internal_ip'
```
Solution: SSRF protection cannot be disabled. Deploy Crawlith on the same network or use public URLs.

Domain filter blocking Start URL doesn’t match --allow whitelist:

# Wrong
crawlith crawl https://www.example.com --allow example.com

Solution:

crawlith crawl https://www.example.com --allow www.example.com,example.com

Network errors Enable debug mode to see connection errors:

crawlith crawl https://example.com --log-level debug

Lock File Errors

Symptom:

Crawlith: command already running for https://example.com (PID 12345)

Cause: Previous crawl didn’t complete cleanly, leaving a lock file. Solution 1: Wait for completion Check if the process is still running:

ps aux | grep crawlith

If running, let it complete or kill it:

kill 12345

Solution 2: Force override From lockManager.ts:50:

crawlith crawl https://example.com --force

This overrides the existing lock.

Only use --force if you’re certain no other crawl is running.

Solution 3: Manual cleanup Lock files are stored at:

~/.crawlith/locks/

List lock files:

ls -lah ~/.crawlith/locks/

Remove stale locks:

rm ~/.crawlith/locks/*.lock

Understanding Lock Files

From lockManager.ts:1, lock files contain:

{
  "pid": 12345,
  "startedAt": 1709251200000,
  "command": "crawl",
  "target": "https://example.com",
  "args": { "limit": 500, "depth": 5 }
}

Locks are:

Unique per command + target URL + options (hashed)
Automatically released on exit
Cleaned up by signal handlers (SIGINT, SIGTERM)
Validated by PID liveness checks

Slow Crawling

Symptom: Crawl takes much longer than expected. Causes:

Rate limiting too conservative Default: 2 req/s Solution:

crawlith crawl https://example.com --rate 5 --concurrency 5

robots.txt crawl-delay Check robots.txt:

User-agent: crawlith
Crawl-delay: 10

This overrides your --rate setting. Solution:

# Override robots.txt (use responsibly)
crawlith crawl https://example.com --ignore-robots --rate 5

Low concurrency Default: 2 concurrent requests Solution:

crawlith crawl https://example.com --concurrency 10

Network latency High RTT to target server adds delay. Diagnosis:

ping example.com
curl -w "@curl-format.txt" -o /dev/null -s https://example.com

Solution: Increase concurrency to compensate:

crawlith crawl https://example.com --concurrency 10 --rate 5

Retries from failed requests From retryPolicy.ts:1, failed requests retry up to 3 times with exponential backoff. Diagnosis:
```
crawlith crawl https://example.com --log-level verbose
```
Look for retries: 3 in output. Solution: Address server errors or skip problematic URLs.

Memory Issues

Symptom:

JavaScript heap out of memory

Causes:

Crawling too many pages Large crawls (10,000+ pages) consume significant memory. Solution:

# Increase Node.js heap size
NODE_OPTIONS="--max-old-space-size=4096" crawlith crawl https://example.com --limit 10000

Large HTML pages Pages exceeding --max-bytes limit. Solution:

crawlith crawl https://example.com --max-bytes 1000000

Memory leaks in long-running crawls Solution: Break into smaller segments:

crawlith crawl https://example.com --limit 1000 --depth 5
crawlith crawl https://example.com --limit 1000 --depth 10

Export Errors

Symptom:

Error: Cannot write export files

Causes:

Permission denied Check output directory permissions:

ls -ld ./crawlith-reports

Solution:

mkdir -p ./crawlith-reports
chmod 755 ./crawlith-reports

Invalid export format Solution:

crawlith crawl https://example.com --export json,html,csv,markdown,visualize

Disk space full Diagnosis:
```
df -h
```
Solution: Free disk space or use a different output directory:
```
crawlith crawl https://example.com --output /path/with/space
```

Database Errors

Symptom:

Error: database locked
Error: SQLITE_CORRUPT

Location:

~/.crawlith/crawlith.db

Solution 1: Close other Crawlith instances SQLite supports limited concurrent writes:

ps aux | grep crawlith
kill <pid>

Solution 2: Repair corrupted database

# Backup first
cp ~/.crawlith/crawlith.db ~/.crawlith/crawlith.db.backup

# Attempt repair
sqlite3 ~/.crawlith/crawlith.db "PRAGMA integrity_check;"

Solution 3: Reset database

This deletes all crawl history.

rm ~/.crawlith/crawlith.db

Crawlith will create a new database on next run.

Redirect Issues

Symptom:

status: 'redirect_limit_exceeded'
status: 'redirect_loop'

Cause: Site has too many redirects or a circular redirect. Solution 1: Increase redirect limit

crawlith crawl https://example.com --max-redirects 5

Default: 2, Maximum: 11 Solution 2: Debug redirect chain

crawlith crawl https://example.com --log-level debug

Look for redirectChain in output:

{
  "redirectChain": [
    { "url": "http://example.com", "status": 301, "target": "https://example.com" },
    { "url": "https://example.com", "status": 301, "target": "https://www.example.com" }
  ]
}

Solution 3: Start from final URL

crawlith crawl https://www.example.com

Proxy Issues

Symptom:

status: 'proxy_connection_failed'

Causes:

Invalid proxy URL Solution:

crawlith crawl https://example.com --proxy http://proxy.example.com:8080

Proxy authentication required Solution:

crawlith crawl https://example.com --proxy http://user:[email protected]:8080

Proxy server unreachable Diagnosis:

curl -x http://proxy.example.com:8080 https://example.com

Soft 404 Detection Issues

Symptom: Pages incorrectly flagged as soft 404s. Cause: --detect-soft404 uses heuristics that may have false positives. Solution: Review soft404_score in exported data:

crawlith crawl https://example.com --detect-soft404 --export json

Scores > 0.7 are flagged. Adjust detection logic if needed.

Orphan Detection Not Working

Symptom: No orphans detected when expected. Cause: Orphan detection disabled by default. Solution:

crawlith crawl https://example.com --orphans

For severity scoring:

crawlith crawl https://example.com --orphans --orphan-severity

For near-orphans:

crawlith crawl https://example.com --orphans --include-soft-orphans --min-inbound 3

Debugging Workflow

Step 1: Enable Debug Logging

crawlith crawl https://example.com --log-level debug > debug.log 2>&1

Step 2: Check robots.txt

curl https://example.com/robots.txt

Step 3: Test Single URL

crawlith crawl https://example.com --limit 1 --log-level debug

Step 4: Review Lock Files

ls -lah ~/.crawlith/locks/
cat ~/.crawlith/locks/*.lock

Step 5: Check Database

sqlite3 ~/.crawlith/crawlith.db ".tables"
sqlite3 ~/.crawlith/crawlith.db "SELECT * FROM snapshots ORDER BY created_at DESC LIMIT 5;"

Step 6: Inspect Exports

crawlith crawl https://example.com --export json --output ./debug-output
cat ./debug-output/example.com/graph.json | jq '.nodes[] | select(.status != 200)'

Error Messages Reference

Error: URL argument is required for crawling

Cause: No URL provided to crawl command.Solution:

crawlith crawl https://example.com

Error: Invalid proxy URL

Cause: Malformed --proxy URL.Solution:

crawlith crawl https://example.com --proxy http://proxy.example.com:8080

Error: --orphan-severity requires --orphans

Cause: Enabled severity without base orphan detection.Solution:

crawlith crawl https://example.com --orphans --orphan-severity

Blocked internal IP: 127.0.0.1

Cause: SSRF protection blocked internal IP.Solution: Use public URLs or deploy Crawlith in the same network.

Failed to fetch robots.txt, proceeding...

Cause: robots.txt unavailable (common and safe to ignore).Solution: No action needed. Crawlith proceeds without robots.txt.

Crawlith: command already running (PID 12345)

Cause: Lock file exists from previous run.Solution:

crawlith crawl https://example.com --force

Detected stale lock. Continuing execution.

Cause: Lock file exists but PID is dead.Solution: No action needed. Crawlith auto-cleans stale locks.

Performance Tuning

Fast Crawling (Your Infrastructure)

crawlith crawl https://example.com \
  --rate 10 \
  --concurrency 10 \
  --max-bytes 5000000 \
  --limit 10000

Balanced (Default)

crawlith crawl https://example.com \
  --rate 2 \
  --concurrency 2 \
  --limit 500

Conservative (Public Sites)

crawlith crawl https://example.com \
  --rate 0.5 \
  --concurrency 1 \
  --limit 200

Getting Help

If you’re still experiencing issues:

Check existing issues: Search GitHub Issues

Collect debug logs:

crawlith crawl https://example.com --log-level debug > debug.log 2>&1

Include system info:

crawlith --version
node --version
uname -a

Create a minimal reproduction:

crawlith crawl https://example.com --limit 10 --log-level debug

Configuration - All command-line options
Rate Limiting - Performance optimization
Security - Understanding security blocks

Technical Details

Source Files

plugins/core/src/lock/lockManager.ts - Lock file management
plugins/core/src/lock/pidCheck.ts - PID liveness checks
plugins/core/src/lock/hashKey.ts - Lock file naming
plugins/cli/src/commands/crawl.ts - CLI command implementation
plugins/cli/src/output/controller.ts - Logging and output formatting

Get Started

Core Commands

Features

Guides

Overview

Debug Mode

Verbose Mode

Common Issues

No Pages Crawled

Lock File Errors

Understanding Lock Files

Slow Crawling

Memory Issues

Export Errors

Database Errors

Redirect Issues

Proxy Issues

Soft 404 Detection Issues

Orphan Detection Not Working

Debugging Workflow

Step 1: Enable Debug Logging

Step 2: Check robots.txt

Step 3: Test Single URL

Step 4: Review Lock Files

Step 5: Check Database

Step 6: Inspect Exports

Error Messages Reference

Performance Tuning

Fast Crawling (Your Infrastructure)

Balanced (Default)

Conservative (Public Sites)

Getting Help

Technical Details

Source Files

Build docs developers (and LLMs) love

Get Started

Core Commands

Features

Guides

​Overview

​Debug Mode

​Verbose Mode

​Common Issues

​No Pages Crawled

​Lock File Errors

​Understanding Lock Files

​Slow Crawling

​Memory Issues

​Export Errors

​Database Errors

​Redirect Issues

​Proxy Issues

​Soft 404 Detection Issues

​Orphan Detection Not Working

​Debugging Workflow

​Step 1: Enable Debug Logging

​Step 2: Check robots.txt

​Step 3: Test Single URL

​Step 4: Review Lock Files

​Step 5: Check Database

​Step 6: Inspect Exports

​Error Messages Reference

​Performance Tuning

​Fast Crawling (Your Infrastructure)

​Balanced (Default)

​Conservative (Public Sites)

​Getting Help

​Related Topics

​Technical Details

​Source Files

Build docs developers (and LLMs) love

Overview

Debug Mode

Verbose Mode

Common Issues

No Pages Crawled

Lock File Errors

Understanding Lock Files

Slow Crawling

Memory Issues

Export Errors

Database Errors

Redirect Issues

Proxy Issues

Soft 404 Detection Issues

Orphan Detection Not Working

Debugging Workflow

Step 1: Enable Debug Logging

Step 2: Check robots.txt

Step 3: Test Single URL

Step 4: Review Lock Files

Step 5: Check Database

Step 6: Inspect Exports

Error Messages Reference

Performance Tuning

Fast Crawling (Your Infrastructure)

Balanced (Default)

Conservative (Public Sites)

Getting Help

Related Topics

Technical Details

Source Files