Overview
Resume mode compares your URL collection CSV against your existing data CSV and creates a temporary queue containing only unscraped URLs. This enables efficient recovery from:- Network interruptions
- Rate limiting
- System crashes
- Manual cancellations
How It Works
1. Queue File Creation
The resume workflow:- Reads all URLs from the URL collection CSV (e.g.,
jiji_urls.csv) - Reads all scraped URLs from the data CSV (e.g.,
jiji_data.csv) using theurlcolumn - Calculates the difference (URLs in collection but not in data)
- Creates a temporary queue CSV with only pending URLs
- Jiji:
outputs/urls/jiji_resume_queue.csv - Meqasa:
outputs/urls/meqasa_resume_queue.csv
2. Progress Tracking
Before scraping begins, you’ll see a summary table:| Source | URL Pool | Already Scraped | Pending |
|---|---|---|---|
| Jiji | 500 | 350 | 150 |
| Meqasa | 300 | 300 | 0 |
- URL Pool: Total unique URLs in the URL CSV
- Already Scraped: URLs found in the data CSV
- Pending: URLs that will be scraped in this run
3. Automatic Cleanup
After the scrape completes (successfully or with errors), temporary queue files are automatically deleted. This ensures:- No stale queue files accumulate
- Fresh queue calculation on next resume
- Clean project state
If all URLs have been scraped (pending = 0), no spider runs and the queue file is not created.
Usage
Interactive Mode (Recommended)
Runmain.py and select the resume option:
- Choose “Resume listing scrape (missing URLs only)”
- Select source (Jiji, Meqasa, or both)
- Confirm whether to use default CSV paths
- Review the resume queue summary
- Scraping begins automatically for pending URLs
Custom CSV Paths
If your URL or data CSVs are in non-default locations:- Select “No” when asked about default paths
- Provide custom paths when prompted:
- URL CSV path (source of all listing URLs)
- Data CSV path (already scraped listings)
Direct Spider Execution
Resume mode is primarily designed for use throughmain.py, but you can manually create queue files and run:
When running spiders directly, you’re responsible for creating and cleaning up queue files.
Resume Queue Logic
The queue building process (main.py:152-206):- Missing URL CSV: Queue is not created, returns
(0, 0, 0) - Empty URL CSV: Queue is not created, returns
(0, 0, 0) - Missing data CSV: All URLs from URL CSV are queued
- All scraped: Existing queue file (if any) is deleted, returns
(total, total, 0) - Partial completion: Queue file created with pending URLs
- URL CSV is deduplicated before comparison
- Only first occurrence of each URL is kept
- Data CSV is read into a set for efficient lookup
Jiji Auto-Cleaning
When resuming Jiji scrapes, if any new listings are scraped, the automatic cleaning process runs after completion: Input:outputs/data/jiji_data.csv (all Jiji listings)
Output: outputs/data/raw.csv (cleaned dataset)
The cleaning:
- Processes the entire
jiji_data.csv(not just newly scraped items) - Overwrites the previous
raw.csv - Runs only if
jiji_data.csvexists
Examples
Basic Resume
After an interrupted Jiji scrape:Resume Both Sources
Resume with Custom Paths
outputs/data/jiji_data.csv.
Resume mode only affects which URLs are queued for scraping. Data is always written to the spider’s configured output CSV path.
Best Practices
- Always use resume mode when re-running listing scrapes to avoid duplicate work
- Review the summary table before confirming to verify pending count
- Don’t manually edit queue files - they’re regenerated on each resume
- Keep URL and data CSVs in sync - don’t delete the data CSV between runs
- Use default paths unless you have a specific reason to customize
Troubleshooting
Queue shows 0 pending but you know listings are missing:- Check that URLs in data CSV exactly match URLs in URL CSV
- Verify the
urlcolumn exists in both files - Ensure no extra whitespace in URL values
- Normal if scrape was interrupted (Ctrl+C)
- Safe to manually delete
*_resume_queue.csvfiles - Will be regenerated on next resume
- Data CSV may contain URLs not in URL CSV (manually added listings)
- Only URLs present in both files affect “Pending” count
- “URL Pool” is the source of truth for total collection size