Skip to main content

Quick Start

This guide will walk you through scraping your first rental property listings from Jiji Ghana or Meqasa using the interactive CLI.
Make sure you’ve completed the installation before proceeding.

Two-Phase Workflow

ScrapeAccraProperties uses a two-phase approach:
  1. Collect listing URLs from search/result pages
  2. Scrape listing details by visiting each URL
This separation allows you to validate URLs before scraping and enables efficient resume operations.

Your First Scrape

1

Launch the interactive CLI

Start the interactive runner:
python main.py
You’ll see the main menu:
╭─────────────────────── main.py ────────────────────────╮
│ Accra Property Scraper                                 │
│ - Interactive multi-spider runner                      │
│ - Listing resume mode queues only missing URLs         │
│ - CSV writes happen item-by-item during crawl          │
│ - Jiji listings are cleaned to outputs/data/raw.csv    │
╰────────────────────────────────────────────────────────╯

Choose action
  1. Collect listing URLs
  2. Scrape listing details
  3. Resume listing scrape (missing URLs only)
  4. Exit
Enter choice [1]:
2

Collect listing URLs

Select option 1 to collect listing URLs.Choose your source:
Select source
  1. Jiji only
  2. Meqasa only
  3. Both Jiji and Meqasa
Enter choice [3]:
For this quickstart, let’s choose 1 (Jiji only).Configure pagination:
Jiji start page [1]: 1

Jiji URL mode
  1. Auto detect total pages
  2. Fixed number of pages
  3. Convert expected listings to page count
Enter choice [1]: 2

Jiji max pages [5]: 2
This will scrape the first 2 pages of Jiji rental listings.
Start with a small number of pages (2-5) for your first run to test the workflow.
The spider will start crawling and display progress:
2024-03-03 10:15:32 [scrapy.core.engine] INFO: Spider opened
2024-03-03 10:15:33 [scrapy.core.engine] INFO: Crawled 1 pages (at 1 pages/min)
...
URLs are saved to: outputs/urls/jiji_urls.csv
3

Scrape listing details

Run the CLI again and select option 2 to scrape listing details:
python main.py
Select 2. Scrape listing detailsChoose your source:
Select source
  1. Jiji only
  2. Meqasa only
  3. Both Jiji and Meqasa
Enter choice [3]: 1
Specify the URL CSV path:
Jiji URL CSV [outputs/urls/jiji_urls.csv]:
Press Enter to use the default path. The spider will:
  • Read URLs from outputs/urls/jiji_urls.csv
  • Visit each listing page
  • Extract structured data (title, location, price, bedrooms, amenities, etc.)
  • Write incrementally to outputs/data/jiji_data.csv
After scraping completes, Jiji listings are automatically cleaned:
[green]Jiji cleaned CSV saved:[/] outputs/data/raw.csv (143 rows)
[bold green]Done.[/]
The cleaning step (producing raw.csv) is currently Jiji-only.
4

Explore your data

Check the outputs/ directory for your scraped data:
ls -R outputs/
outputs/
├── urls/
│   └── jiji_urls.csv
└── data/
    ├── jiji_data.csv
    └── raw.csv
View the data:
head -n 5 outputs/data/raw.csv
You’ll see rental listings with fields like:
  • url - Listing URL
  • title - Property title
  • location - Area/neighborhood
  • house_type - Apartment, house, etc.
  • bedrooms - Number of bedrooms
  • bathrooms - Number of bathrooms
  • price - Rental price
  • amenities - List of amenities
  • description - Full listing description
  • fetch_date - When the data was scraped

Resume Scraping

If your scrape is interrupted, you can resume without re-scraping existing listings:
1

Select resume mode

Run the CLI and choose option 3. Resume listing scrape (missing URLs only)
python main.py
2

Review the queue summary

The CLI will show which URLs are already scraped vs. pending:
Use default URL/data CSV paths for resume mode? [Y/n]: y

┌─────── Resume Queue Summary ────────┐
│ Source  │ URL Pool │ Already Scraped │ Pending │
├─────────┼──────────┼─────────────────┼─────────┤
│ Jiji    │      156 │             143 │      13 │
└─────────┴──────────┴─────────────────┴─────────┘
Only the 13 pending URLs will be scraped.
3

Scrape completes

The spider queues only missing URLs and completes the scrape. Temporary queue files are automatically cleaned up.

Running Spiders Directly

You can bypass the interactive CLI and run spiders directly with scrapy:
scrapy crawl jiji_urls -a start_page=1
When running spiders directly, you’ll need to manually run clean.py for Jiji data cleaning. The interactive CLI handles this automatically.

Next Steps

Understand the Workflow

Learn about the two-phase workflow, resume mode, and data cleaning pipeline

Configure Settings

Customize spider behavior, pagination, concurrency, and Playwright options

Output Schema

Understand the structure of URL CSVs and listing data CSVs

Troubleshooting

Fix common issues like browser failures, empty fields, and resume errors

Build docs developers (and LLMs) love