Resume Mode

Resume mode compares your URL collection CSVs against your scraped data CSVs to identify which listings haven’t been scraped yet. It creates temporary queue files containing only the missing URLs, so you can continue scraping without duplicating work.

What Resume Mode Does

Resume mode performs intelligent diffing between two CSV files:

URL CSV (outputs/urls/jiji_urls.csv or meqasa_urls.csv): All discovered listing URLs
Data CSV (outputs/data/jiji_data.csv or meqasa_data.csv): Already scraped listings

It generates a resume queue CSV containing only URLs that:

Exist in the URL CSV
Do not exist in the data CSV

The comparison uses the url column present in both files.

Resume queue files are temporary and are automatically deleted after the scraping run completes.

How It Works

Read all URLs from URL CSV

Loads all URLs from the source URL CSV file(s).

Read already-scraped URLs from data CSV

Loads the url column from the data CSV to identify which listings have already been scraped.

Calculate diff

Identifies URLs in the URL CSV that are not in the data CSV.

pending_urls = url_csv_urls - data_csv_urls

Generate resume queue CSV

Writes pending URLs to a temporary queue file:

outputs/urls/jiji_resume_queue.csv
outputs/urls/meqasa_resume_queue.csv

Run listing spider on queue

The listing spider reads from the resume queue CSV instead of the full URL CSV.

Clean up queue file

After scraping completes (or fails), the resume queue CSV is automatically deleted.

When to Use Resume Mode

Resume mode is ideal for:

Interrupted scrapes: Your scraper crashed or was stopped mid-run
Incremental updates: You collected new URLs and only want to scrape the new ones
Failed requests: Some URLs failed to scrape and you want to retry only those
Large datasets: You’re scraping thousands of URLs and want to pause/resume

Resume mode is not necessary if you’re starting fresh. Just use regular “Scrape listing details” action.

Using Resume Mode

Via Interactive Mode

Run python main.py and select option 3:

Choose action
  1. Collect listing URLs
  2. Scrape listing details
  3. Resume listing scrape (missing URLs only)
  4. Exit
Enter choice [1]: 3

Select sites

Choose Jiji, Meqasa, or both:

Select source
  1. Jiji only
  2. Meqasa only
  3. Both Jiji and Meqasa
Enter choice [3]:

Choose CSV paths

Use default paths or specify custom ones:

Use default URL/data CSV paths for resume mode? [Y/n]:

If you choose Yes (default):

URL CSV: outputs/urls/jiji_urls.csv / outputs/urls/meqasa_urls.csv
Data CSV: outputs/data/jiji_data.csv / outputs/data/meqasa_data.csv

If you choose No:

You’ll be prompted for custom paths for each site

View resume queue summary

A table shows the queue analysis:

┌─ Resume Queue Summary ────────────────────────────┐
│ Source  │ URL Pool │ Already Scraped │ Pending   │
├─────────┼──────────┼─────────────────┼───────────┤
│ Jiji    │    1,245 │           1,180 │        65 │
│ Meqasa  │      892 │             892 │         0 │
└───────────────────────────────────────────────────┘

URL Pool: Total unique URLs in the URL CSV
Already Scraped: URLs found in the data CSV
Pending: URLs that will be queued for scraping

Spider runs on pending URLs only

Only sites with Pending > 0 are scraped.In the example above, only Jiji would run (65 URLs).

Example Terminal Session

Choose action
  1. Collect listing URLs
  2. Scrape listing details
  3. Resume listing scrape (missing URLs only)
  4. Exit
Enter choice [1]: 3

Select source
  1. Jiji only
  2. Meqasa only
  3. Both Jiji and Meqasa
Enter choice [3]: 3

Use default URL/data CSV paths for resume mode? [Y/n]: y

┌─ Resume Queue Summary ────────────────────────────┐
│ Source  │ URL Pool │ Already Scraped │ Pending   │
├─────────┼──────────┼─────────────────┼───────────┤
│ Jiji    │    1,245 │           1,180 │        65 │
│ Meqasa  │      892 │             892 │         0 │
└───────────────────────────────────────────────────┘

[Jiji listing spider runs on 65 URLs...]
[Jiji cleaning runs...]

Done.

Resume Queue File Format

Resume queue CSVs have the same format as the original URL CSVs:

outputs/urls/jiji_resume_queue.csv

url,page,fetch_date
https://jiji.com.gh/accra-metropolitan/houses-apartments-for-rent/...,5,2026-03-02
https://jiji.com.gh/accra-metropolitan/houses-apartments-for-rent/...,5,2026-03-02
https://jiji.com.gh/accra-metropolitan/houses-apartments-for-rent/...,6,2026-03-02

The queue preserves all columns from the original URL CSV, not just the url column.

Implementation Details

From main.py:152-206, the build_resume_queue() function:

Reads all rows from the URL CSV
Loads the url column from the data CSV into a set
Filters out URLs that are already in the data CSV
Writes remaining rows to the resume queue CSV
Returns counts: (source_total, already_scraped, pending_total)

def build_resume_queue(
    urls_csv: Path, data_csv: Path, queue_csv: Path
) -> tuple[int, int, int]:
    # Read all URL rows
    with open(urls_csv, newline="", encoding="utf-8") as f:
        rows = list(csv.DictReader(f))
    
    # Get already-scraped URLs
    scraped_urls = read_url_set(data_csv, "url")
    
    # Filter out scraped URLs
    source_seen: set[str] = set()
    pending_rows: list[dict[str, str]] = []
    
    for row in rows:
        url = (row.get("url") or "").strip()
        if not url or url in source_seen:
            continue
        source_seen.add(url)
        if url in scraped_urls:
            continue
        pending_rows.append(row)
    
    # Write queue CSV
    with open(queue_csv, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames, ...)
        writer.writeheader()
        writer.writerows(pending_rows)
    
    return (len(source_seen), len(source_seen) - len(pending_rows), len(pending_rows))

Edge Cases

No pending URLs

If all URLs have already been scraped:

┌─ Resume Queue Summary ────────────────────────────┐
│ Source  │ URL Pool │ Already Scraped │ Pending   │
├─────────┼──────────┼─────────────────┼───────────┤
│ Jiji    │    1,245 │           1,245 │         0 │
└───────────────────────────────────────────────────┘

No spiders queued.
Done.

No resume queue file is created, and no spider runs.

URL CSV doesn’t exist

If the URL CSV file is missing:

┌─ Resume Queue Summary ────────────────────────────┐
│ Source  │ URL Pool │ Already Scraped │ Pending   │
├─────────┼──────────┼─────────────────┼───────────┤
│ Jiji    │        0 │               0 │         0 │
└───────────────────────────────────────────────────┘

Run “Collect listing URLs” first to generate the URL CSV.

Data CSV doesn’t exist

If the data CSV is missing, resume mode treats all URLs as pending:

┌─ Resume Queue Summary ────────────────────────────┐
│ Source  │ URL Pool │ Already Scraped │ Pending   │
├─────────┼──────────┼─────────────────┼───────────┤
│ Jiji    │    1,245 │               0 │     1,245 │
└───────────────────────────────────────────────────┘

This is equivalent to scraping all URLs fresh.

Example Workflow

Scenario: You’re scraping 5,000 Jiji listings but the script crashes after 3,000.

Initial scrape (crashes mid-run)

python main.py
# Choose: Scrape listing details
# Choose: Jiji only
# [Scrapes 3,000 URLs, then crashes]

Result: outputs/data/jiji_data.csv contains 3,000 rows

Resume the scrape

python main.py
# Choose: Resume listing scrape (missing URLs only)
# Choose: Jiji only
# Use default paths: Yes

Output:

┌─ Resume Queue Summary ──────────────────────────┐
│ Source │ URL Pool │ Already Scraped │ Pending  │
├────────┼──────────┼─────────────────┼──────────┤
│ Jiji   │    5,000 │           3,000 │    2,000 │
└─────────────────────────────────────────────────┘

The spider scrapes the remaining 2,000 URLs.

Verify completion

python main.py
# Choose: Resume listing scrape (missing URLs only)
# Choose: Jiji only

Output:

┌─ Resume Queue Summary ──────────────────────────┐
│ Source │ URL Pool │ Already Scraped │ Pending  │
├────────┼──────────┼─────────────────┼──────────┤
│ Jiji   │    5,000 │           5,000 │        0 │
└─────────────────────────────────────────────────┘

No spiders queued.
Done.

URL Deduplication

Resume mode handles duplicate URLs in the source URL CSV:

Duplicates within the URL CSV are removed (only the first occurrence is kept)
The comparison against the data CSV uses the deduplicated set

This ensures accurate counts even if your URL collection process created duplicates.

Get Started

Core Concepts

Usage

Commands

Configuration

Reference

What Resume Mode Does

How It Works

When to Use Resume Mode

Using Resume Mode

Via Interactive Mode

Example Terminal Session

Resume Queue File Format

Implementation Details

Edge Cases

No pending URLs

URL CSV doesn’t exist

Data CSV doesn’t exist

Example Workflow

URL Deduplication

Build docs developers (and LLMs) love

Get Started

Core Concepts

Usage

Commands

Configuration

Reference

​What Resume Mode Does

​How It Works

​When to Use Resume Mode

​Using Resume Mode

​Via Interactive Mode

​Example Terminal Session

​Resume Queue File Format

​Implementation Details

​Edge Cases

​No pending URLs

​URL CSV doesn’t exist

​Data CSV doesn’t exist

​Example Workflow

​URL Deduplication

Build docs developers (and LLMs) love

What Resume Mode Does

How It Works

When to Use Resume Mode

Using Resume Mode

Via Interactive Mode

Example Terminal Session

Resume Queue File Format

Implementation Details

Edge Cases

No pending URLs

URL CSV doesn’t exist

Data CSV doesn’t exist

Example Workflow

URL Deduplication