Skip to main content
ScrapeAccraProperties uses a two-phase workflow to efficiently collect and extract rental property data from Jiji Ghana and Meqasa.

Overview

The scraper separates URL discovery from data extraction to enable:
  • Incremental scraping with resume capability
  • URL deduplication across multiple runs
  • Progress tracking and failure recovery
  • Efficient resource usage
1

Phase 1: URL Collection

Spider crawls search result pages and extracts listing URLs, saving them to CSV files in outputs/urls/.
2

Phase 2: Listing Extraction

Spider reads the URL CSV files and visits each listing to extract structured property data, saving to outputs/data/.

Phase 1: URL Collection

How It Works

URL spiders (jiji_urls and meqasa_urls) navigate through paginated search results to collect all available listing URLs. Jiji URL Collection:
scrapy crawl jiji_urls -a start_page=1 -a max_pages=5
scrapy crawl jiji_urls -a start_page=1 -a total_listing=200
Meqasa URL Collection:
scrapy crawl meqasa_urls -a start_page=1 -a total_pages=5

Auto-Detection Mode

When neither max_pages nor total_pages is specified, spiders automatically detect the total number of results:
# From jiji_urls.py:64-84
if response.meta.get("is_detector") and not self._detected:
    self._detected = True
    if count_text := response.css(
        'div.b-breadcrumb-link--current-url span[property="name"]::text'
    ).get():
        if match := re.search(r"([\d,]+)\\s+results", count_text):
            total = int(match.group(1).replace(",", ""))
            self.max_pages = math.ceil(total / LISTINGS_PER_PAGE)
            self.logger.info(
                f"🔍 Jiji: {total:,} results (~{self.max_pages} pages)"
            )

Output Files

SpiderOutput PathSchema
jiji_urlsoutputs/urls/jiji_urls.csvurl, page, fetch_date
meqasa_urlsoutputs/urls/meqasa_urls.csvurl, page, fetch_date

Phase 2: Listing Extraction

How It Works

Listing spiders read URLs from the Phase 1 CSV files and visit each listing page to extract detailed property information. Jiji Listing Scrape:
scrapy crawl jiji_listings -a csv_path=outputs/urls/jiji_urls.csv
Meqasa Listing Scrape:
scrapy crawl meqasa_listings -a csv_path=outputs/urls/meqasa_urls.csv
The csv_path parameter is relative to the project root. You can also provide an absolute path.

Output Files

SpiderOutput PathDescription
jiji_listingsoutputs/data/jiji_data.csvRaw Jiji listing data
meqasa_listingsoutputs/data/meqasa_data.csvRaw Meqasa listing data
Post-processingoutputs/data/raw.csvCleaned Jiji data (auto-generated)

Automatic Cleaning

After a Jiji listing scrape completes, the clean.py script runs automatically to normalize and clean the data:
# From main.py:364-365
if jiji_job_scheduled:
    run_jiji_cleaning()
The cleaned dataset is written to outputs/data/raw.csv with standardized fields for downstream analysis.
Cleaning is currently Jiji-only. Meqasa data is not automatically cleaned.

Resume Mode

Resume mode enables you to scrape only missing listings by comparing URL CSVs against data CSVs.

How Resume Works

1

Compare URL sets

Resume mode reads URLs from both the URL CSV and the data CSV.
2

Identify missing URLs

It calculates which URLs from the URL CSV haven’t been scraped yet.
3

Create temporary queue

Missing URLs are written to a temporary resume queue CSV file.
4

Run listing spider

The listing spider reads the queue file instead of the full URL CSV.
5

Cleanup

The temporary queue file is automatically deleted after the run completes.

Resume Queue Generation

# From main.py:152-206
def build_resume_queue(
    urls_csv: Path, data_csv: Path, queue_csv: Path
) -> tuple[int, int, int]:
    # Load scraped URLs
    scraped_urls = read_url_set(data_csv, "url")
    
    # Deduplicate and filter pending URLs
    for row in rows:
        url = (row.get("url") or "").strip()
        if not url or url in source_seen:
            continue
        source_seen.add(url)
        if url in scraped_urls:
            continue
        pending_rows.append(row)
    
    # Write queue file with pending URLs only
    with open(queue_csv, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames, ...)
        writer.writeheader()
        writer.writerows(pending_rows)

Resume Queue Files

SiteQueue FileLifecycle
Jijioutputs/urls/jiji_resume_queue.csvTemporary (auto-deleted)
Meqasaoutputs/urls/meqasa_resume_queue.csvTemporary (auto-deleted)

Running Resume Mode

Using the interactive CLI:
python main.py
# Select: "Resume listing scrape (missing URLs only)"
The CLI displays a summary before starting:
┌────────────────── Resume Queue Summary ──────────────────┐
│ Source  │ URL Pool │ Already Scraped │      Pending │
├─────────┼──────────┼─────────────────┼──────────────┤
│ Jiji    │    1,250 │           1,180 │          70 │
│ Meqasa  │      876 │             876 │           0 │
└─────────┴──────────┴─────────────────┴──────────────┘
Resume mode is ideal for:
  • Recovering from interrupted scrapes
  • Adding newly discovered listings
  • Re-scraping after fixing extraction bugs

Workflow Orchestration

The main.py file provides an interactive CLI that orchestrates the entire workflow:
# From main.py:382-402
def main() -> None:
    print_header()
    action = ask_choice(
        "Choose action",
        [
            ("urls", "Collect listing URLs"),
            ("listings", "Scrape listing details"),
            ("resume", "Resume listing scrape (missing URLs only)"),
            ("exit", "Exit"),
        ],
        default="urls",
    )

    if action == "urls":
        run_url_collection()
    elif action == "listings":
        run_listing_scrape(resume=False)
    elif action == "resume":
        run_listing_scrape(resume=True)

Site Selection

For each action, you can choose which site(s) to scrape:
  • Jiji only — Scrape only Jiji Ghana listings
  • Meqasa only — Scrape only Meqasa listings
  • Both Jiji and Meqasa — Run spiders for both sites in parallel
# From main.py:125-137
def choose_sites() -> list[SiteConfig]:
    choice = ask_choice(
        "Select source",
        [
            ("jiji", "Jiji only"),
            ("meqasa", "Meqasa only"),
            ("both", "Both Jiji and Meqasa"),
        ],
        default="both",
    )
    if choice == "both":
        return [SITE_CONFIGS["jiji"], SITE_CONFIGS["meqasa"]]
    return [SITE_CONFIGS[choice]]

Output Directory Structure

All outputs are organized under the outputs/ directory:
outputs/
├── urls/
│   ├── jiji_urls.csv              # Phase 1: Jiji URLs
│   ├── meqasa_urls.csv            # Phase 1: Meqasa URLs
│   ├── jiji_resume_queue.csv      # Temporary (when using resume mode)
│   └── meqasa_resume_queue.csv    # Temporary (when using resume mode)
└── data/
    ├── jiji_data.csv              # Phase 2: Raw Jiji listings
    ├── meqasa_data.csv            # Phase 2: Raw Meqasa listings
    └── raw.csv                    # Cleaned Jiji data (post-processing)
Directories are created automatically when spiders first run. No manual setup is required.

Next Steps

Spider Architecture

Learn about the PropertyBaseSpider class and how spiders work

Data Schema

Explore the CSV schemas and field definitions for each site

Build docs developers (and LLMs) love