Skip to main content

Crawl Your First Site

Let’s crawl a website and generate a full report with link graphs, issue detection, and exports.
1

Run a basic crawl

Start by crawling a small site with default settings:
crawlith crawl https://example.com
You’ll see real-time progress as Crawlith discovers and fetches pages:
🚀 Starting Crawlith Site Crawler
Target: https://example.com
Limits: Pages: 500 | Depth: 5

🔍 Fetching robots.txt... Done
📄 Crawling pages... [50/500] [Depth: 2/5]
✅ Crawl complete.
🔍 Detecting duplicates... Done
🧩 Clustering content... Done
📊 Calculating final report metrics... Done
2

Review the results

After the crawl completes, Crawlith displays a comprehensive report:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📊 Crawl Report for https://example.com
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

📈 Overview
  Pages Crawled:        247
  Total Links:          1,234
  Avg Links/Page:       5.0
  Max Depth Reached:    4

🏥 Health Score: 87.5

⚠️  Issues Detected
  Broken Links:         3
  Redirect Chains:      2
  Orphan Pages:         0
  Soft 404s:            1

💾 run `crawlith ui` to view the full report
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
3

Launch the Web Dashboard

For interactive visualization, launch the web UI:
crawlith ui
This opens a React-based dashboard in your browser where you can:
  • Browse all crawl snapshots by site
  • View interactive D3.js link graphs
  • Drill down into individual pages and their connections
  • Compare metrics across snapshots

Common Crawl Options

Customize your crawl with these frequently used options:

Limit Pages and Depth

crawlith crawl https://example.com --limit 100 --depth 3
  • --limit: Maximum number of pages to crawl (default: 500)
  • --depth: Maximum click depth from the starting URL (default: 5)

Enable Issue Detection

crawlith crawl https://example.com --detect-soft404 --detect-traps --orphans
  • --detect-soft404: Identify pages that return 200 but are actually error pages
  • --detect-traps: Detect infinite URL parameter spaces (e.g., calendars)
  • --orphans: Find pages with no internal links pointing to them

Export Results

crawlith crawl https://example.com --export json,csv,html,visualize
Exports are saved to ./crawlith-reports/<domain>/:
  • JSON: Complete graph data with nodes and edges
  • CSV: Tabular page data for spreadsheet analysis
  • HTML: Standalone HTML report
  • Visualize: Interactive D3.js link graph
Use --output <path> to customize the export directory:
crawlith crawl https://example.com --export json --output ./reports

Analyze a Single Page

For quick on-page SEO analysis without a full crawl:
crawlith page https://example.com/about
This analyzes the page from your local crawl database. Add --live to fetch fresh data:
crawlith page https://example.com/about --live
Output includes:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📄 Page Analysis: https://example.com/about
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

🏥 Health Score: 92.3

📝 SEO Signals
  Title:              About Us - Example Corp (25 chars) ✓
  Meta Description:   Learn about our mission... (145 chars) ✓
  H1 Count:           1 ✓
  Canonical:          https://example.com/about
  Indexable:          Yes

📊 Content Analysis
  Word Count:         847
  Unique Sentences:   42
  Text/HTML Ratio:    0.65
  Thin Content Score: 15 (Good)

🔗 Link Analysis
  Internal Links:     12
  External Links:     3
  External Ratio:     20.0%

♿ Accessibility
  Total Images:       5
  Missing Alt Text:   0 ✓
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Incremental Crawling

For large sites, use incremental crawling to re-crawl efficiently:
crawlith crawl https://example.com --incremental
This uses ETag and Last-Modified headers from the previous snapshot to skip unchanged pages, dramatically reducing crawl time.

Infrastructure Auditing

Run deep infrastructure checks:
crawlith audit https://example.com
This performs:
  • TLS Analysis: Certificate validity, protocol versions, cipher suites
  • DNS Checks: Resolution time, record types, DNSSEC
  • Security Headers: HSTS, CSP, X-Frame-Options, etc.
  • Transport Analysis: HTTP/2 support, compression, response times
Example output:
🔒 TLS Certificate
  Valid:              Yes ✓
  Issuer:             Let's Encrypt
  Expires:            2026-05-15
  Protocol:           TLSv1.3 ✓

🛡️  Security Headers
  Strict-Transport-Security: max-age=31536000 ✓
  Content-Security-Policy:   Present ✓
  X-Frame-Options:           DENY ✓
  X-Content-Type-Options:    nosniff ✓

📡 DNS Performance
  Resolution Time:    23ms ✓
  IPv6 Support:       Yes

Next Steps

CLI Reference

Explore all crawl options and advanced flags

Page Analysis

Deep dive into on-page SEO analysis features

Web Dashboard

Learn how to navigate the interactive UI

Export Formats

Understand all available export options
All crawl data is persisted in ~/.crawlith/crawlith.db. Use crawlith sites to list all tracked sites and crawlith clean to remove old snapshots.

Build docs developers (and LLMs) love