Skip to main content
The listing scrape phase visits individual property URLs and extracts structured data. This is the second step in the scraping workflow.

Overview

Listing spiders read URLs from CSV files (typically generated by URL collection spiders) and extract detailed property information from each listing page. Data is written incrementally to CSV files in outputs/data/.

CSV Path Requirements

Both spiders require a csv_path argument pointing to a CSV file containing a url column.
csv_path
string
required
Path to CSV file containing listing URLs. Can be absolute or relative to project root.Expected format:
  • Must have a url column
  • Optional fetch_date column (Jiji only)
Default paths:
  • Jiji: outputs/urls/jiji_urls.csv
  • Meqasa: outputs/urls/meqasa_urls.csv
If the CSV file doesn’t exist or contains no URLs, the spider will log an error and exit without scraping.

Jiji Listing Spider

Extracts detailed property information from Jiji listing pages.

Output

File: outputs/data/jiji_data.csv Fields:
  • url - Listing URL
  • fetch_date - Date when listing was scraped
  • title - Property title
  • location - Property location/region
  • house_type - Type of property (apartment, house, etc.)
  • bedrooms - Number of bedrooms
  • bathrooms - Number of bathrooms
  • price - Rental price
  • properties - JSON object with additional attributes
  • amenities - JSON array of available amenities
  • description - Property description text

Examples

Scrape from default URL CSV:
scrapy crawl jiji_listings
Scrape from custom CSV:
scrapy crawl jiji_listings -a csv_path=outputs/urls/jiji_urls.csv
Scrape from absolute path:
scrapy crawl jiji_listings -a csv_path=/path/to/custom_urls.csv

Automatic Cleaning

After Jiji listing scrapes complete (when run through main.py), the project automatically runs a cleaning script that processes jiji_data.csv and outputs a cleaned version. Cleaned output: outputs/data/raw.csv This cleaning step:
  • Standardizes price formats
  • Normalizes location names
  • Parses structured fields from JSON
  • Removes duplicates and invalid entries
Automatic cleaning only runs when using the interactive main.py runner, not when running spiders directly via scrapy crawl.

Meqasa Listing Spider

Extracts property details from Meqasa listing pages.

Output

File: outputs/data/meqasa_data.csv Fields:
  • url - Listing URL
  • Title - Property title
  • Price - Rental price
  • Rate - Price period (per month, per year, etc.)
  • Description - Property description
  • fetch_date - Date when listing was scraped
  • Categories - Property categories
  • Lease options - Available lease options
  • Bedrooms - Number of bedrooms
  • Bathrooms - Number of bathrooms
  • Garage - Garage/parking information
  • Furnished - Furnished status
  • Amenities - Available amenities
  • Address - Property address
  • Reference - Listing reference ID
  • details - JSON object with all extracted table data
Meqasa extracts data from dynamic table structures. The details field contains all key-value pairs found in listing detail tables, allowing for flexible schema evolution.

Examples

Scrape from default URL CSV:
scrapy crawl meqasa_listings
Scrape from custom CSV:
scrapy crawl meqasa_listings -a csv_path=outputs/urls/meqasa_urls.csv
Scrape from absolute path:
scrapy crawl meqasa_listings -a csv_path=/path/to/custom_urls.csv

Progress Tracking

Both spiders display real-time progress with:
  • Current item number
  • Total items to scrape
  • Failed requests count
  • Estimated completion percentage
Progress is updated after each successfully scraped listing.

Build docs developers (and LLMs) love