Overview
This guide covers common issues when using Crawlith, debugging techniques, and solutions to frequently encountered problems.Debug Mode
Enable detailed logging for troubleshooting:- Individual URL fetch status
- Robots.txt blocking decisions
- Security policy evaluations
- Rate limiting behavior
- Error stack traces
Verbose Mode
For less detailed output:- Session statistics (fetched, cached, skipped pages)
- Timing information
- Queue status
Common Issues
No Pages Crawled
Symptom:-
Robots.txt blocking
Check if your crawler is blocked:
Look for:Solution:
-
SSRF Protection triggered
If crawling
localhost,127.0.0.1, or internal IPs:Solution: SSRF protection cannot be disabled. Deploy Crawlith on the same network or use public URLs. -
Domain filter blocking
Start URL doesn’t match
--allowwhitelist:Solution: -
Network errors
Enable debug mode to see connection errors:
Lock File Errors
Symptom:lockManager.ts:50:
Understanding Lock Files
FromlockManager.ts:1, lock files contain:
- Unique per command + target URL + options (hashed)
- Automatically released on exit
- Cleaned up by signal handlers (SIGINT, SIGTERM)
- Validated by PID liveness checks
Slow Crawling
Symptom: Crawl takes much longer than expected. Causes:-
Rate limiting too conservative
Default: 2 req/s
Solution:
-
robots.txt crawl-delay
Check robots.txt:
This overrides your
--ratesetting. Solution: -
Low concurrency
Default: 2 concurrent requests
Solution:
-
Network latency
High RTT to target server adds delay.
Diagnosis:
Solution: Increase concurrency to compensate:
-
Retries from failed requests
From
retryPolicy.ts:1, failed requests retry up to 3 times with exponential backoff. Diagnosis:Look forretries: 3in output. Solution: Address server errors or skip problematic URLs.
Memory Issues
Symptom:-
Crawling too many pages
Large crawls (10,000+ pages) consume significant memory.
Solution:
-
Large HTML pages
Pages exceeding
--max-byteslimit. Solution: -
Memory leaks in long-running crawls
Solution: Break into smaller segments:
Export Errors
Symptom:-
Permission denied
Check output directory permissions:
Solution:
-
Invalid export format
Solution:
-
Disk space full
Diagnosis:
Solution: Free disk space or use a different output directory:
Database Errors
Symptom:Redirect Issues
Symptom:redirectChain in output:
Proxy Issues
Symptom:-
Invalid proxy URL
Solution:
-
Proxy authentication required
Solution:
-
Proxy server unreachable
Diagnosis:
Soft 404 Detection Issues
Symptom: Pages incorrectly flagged as soft 404s. Cause:--detect-soft404 uses heuristics that may have false positives.
Solution:
Review soft404_score in exported data:
Orphan Detection Not Working
Symptom: No orphans detected when expected. Cause: Orphan detection disabled by default. Solution:Debugging Workflow
Step 1: Enable Debug Logging
Step 2: Check robots.txt
Step 3: Test Single URL
Step 4: Review Lock Files
Step 5: Check Database
Step 6: Inspect Exports
Error Messages Reference
Error: URL argument is required for crawling
Error: URL argument is required for crawling
Cause: No URL provided to crawl command.Solution:
Error: Invalid proxy URL
Error: Invalid proxy URL
Cause: Malformed
--proxy URL.Solution:Error: --orphan-severity requires --orphans
Error: --orphan-severity requires --orphans
Cause: Enabled severity without base orphan detection.Solution:
Blocked internal IP: 127.0.0.1
Blocked internal IP: 127.0.0.1
Cause: SSRF protection blocked internal IP.Solution: Use public URLs or deploy Crawlith in the same network.
Failed to fetch robots.txt, proceeding...
Failed to fetch robots.txt, proceeding...
Cause: robots.txt unavailable (common and safe to ignore).Solution: No action needed. Crawlith proceeds without robots.txt.
Crawlith: command already running (PID 12345)
Crawlith: command already running (PID 12345)
Cause: Lock file exists from previous run.Solution:
Detected stale lock. Continuing execution.
Detected stale lock. Continuing execution.
Cause: Lock file exists but PID is dead.Solution: No action needed. Crawlith auto-cleans stale locks.
Performance Tuning
Fast Crawling (Your Infrastructure)
Balanced (Default)
Conservative (Public Sites)
Getting Help
If you’re still experiencing issues:- Check existing issues: Search GitHub Issues
- Collect debug logs:
- Include system info:
- Create a minimal reproduction:
Related Topics
- Configuration - All command-line options
- Rate Limiting - Performance optimization
- Security - Understanding security blocks
Technical Details
Source Files
plugins/core/src/lock/lockManager.ts- Lock file managementplugins/core/src/lock/pidCheck.ts- PID liveness checksplugins/core/src/lock/hashKey.ts- Lock file namingplugins/cli/src/commands/crawl.ts- CLI command implementationplugins/cli/src/output/controller.ts- Logging and output formatting