Overview
Incremental crawls allow you to efficiently re-crawl a site by only fetching pages that have changed since the last crawl. Crawlith uses HTTP conditional requests (ETag and Last-Modified headers) to detect changes and can compare snapshots to identify what’s new, removed, or modified.How It Works
Conditional GET Requests
When crawling a page that was previously fetched, Crawlith sends conditional GET requests:- 304 Not Modified if the content hasn’t changed → Crawlith uses cached data
- 200 OK with new content if the page changed → Crawlith fetches and processes it
Handling Cached Responses
When a 304 response is received, Crawlith reuses previous data without re-parsing:- Reduces server load (no full HTML transfer for unchanged pages)
- Faster crawls (skip HTML parsing and link extraction)
- Lower bandwidth usage
Snapshot Comparison
After an incremental crawl, you can compare snapshots to see what changed:- Added URLs: Pages that didn’t exist in the previous snapshot
- Removed URLs: Pages that are no longer found
- Status changes: Pages with different HTTP status codes
- Canonical changes: Modifications to canonical URL declarations
- Duplicate group changes: Pages moving between duplicate clusters
- Metric deltas: Changes in structural entropy, orphan count, and crawl efficiency
CLI Usage
Run an Incremental Crawl
--incremental is used, Crawlith:
- Loads the most recent snapshot from the database
- Sends conditional GET requests using stored ETags and Last-Modified dates
- Reuses cached content for unchanged pages (304 responses)
- Only fully fetches and parses pages that have changed
Compare Two Snapshots
When to Use Incremental Crawls
Frequent Monitoring
Monitor site changes daily or weekly by only fetching what’s new
Large Sites
Reduce crawl time on sites with thousands of pages that rarely change
Content Audits
Track which pages are being updated and how often
CI/CD Integration
Detect unintended changes in staging or production deployments
Understanding the Results
Cached vs. Fetched Status
After an incremental crawl, check the crawl status metrics:- cached: Page returned 304 Not Modified, used previous data
- fetched: Page returned new content (200 OK)
Metric Deltas
The comparison provides quantitative metrics:- structuralEntropy: Increase suggests more diverse link patterns (may indicate new content silos)
- orphanCount: Positive delta means more unreachable pages
- crawlEfficiency: Decrease means pages are buried deeper in the site
Best Practices
Set appropriate crawl frequency
Set appropriate crawl frequency
Don’t crawl more often than content updates. Most sites benefit from daily or weekly incremental crawls.
Keep baseline snapshots
Keep baseline snapshots
Export and archive baseline snapshots before major site changes (redesigns, migrations) for comparison.
Monitor removed URLs
Monitor removed URLs
Large numbers of removed URLs may indicate broken internal links or content archival issues.
Validate status changes
Validate status changes
Unexpected status changes (200 → 404 or 200 → 301) often reveal broken links or missing redirects.
Technical Details
Snapshot Types
Crawlith supports three snapshot types:- full: Complete crawl with no previous reference
- incremental: Uses previous snapshot for conditional requests
- partial: Custom crawl subset (advanced use)
Database Storage
ETags and Last-Modified dates are persisted in the database:See Also
Web Crawling
Learn about the core crawling algorithm and options
Graph Analysis
Analyze snapshot data to identify structural changes
Export Data
Export snapshots for long-term archival and comparison