Overview
Listing spiders read URLs from CSV files (typically generated by URL collection spiders) and extract detailed property information from each listing page. Data is written incrementally to CSV files inoutputs/data/.
CSV Path Requirements
Both spiders require acsv_path argument pointing to a CSV file containing a url column.
Path to CSV file containing listing URLs. Can be absolute or relative to project root.Expected format:
- Must have a
urlcolumn - Optional
fetch_datecolumn (Jiji only)
- Jiji:
outputs/urls/jiji_urls.csv - Meqasa:
outputs/urls/meqasa_urls.csv
If the CSV file doesn’t exist or contains no URLs, the spider will log an error and exit without scraping.
Jiji Listing Spider
Extracts detailed property information from Jiji listing pages.Output
File:outputs/data/jiji_data.csv
Fields:
url- Listing URLfetch_date- Date when listing was scrapedtitle- Property titlelocation- Property location/regionhouse_type- Type of property (apartment, house, etc.)bedrooms- Number of bedroomsbathrooms- Number of bathroomsprice- Rental priceproperties- JSON object with additional attributesamenities- JSON array of available amenitiesdescription- Property description text
Examples
Scrape from default URL CSV:Automatic Cleaning
After Jiji listing scrapes complete (when run throughmain.py), the project automatically runs a cleaning script that processes jiji_data.csv and outputs a cleaned version.
Cleaned output: outputs/data/raw.csv
This cleaning step:
- Standardizes price formats
- Normalizes location names
- Parses structured fields from JSON
- Removes duplicates and invalid entries
Automatic cleaning only runs when using the interactive
main.py runner, not when running spiders directly via scrapy crawl.Meqasa Listing Spider
Extracts property details from Meqasa listing pages.Output
File:outputs/data/meqasa_data.csv
Fields:
url- Listing URLTitle- Property titlePrice- Rental priceRate- Price period (per month, per year, etc.)Description- Property descriptionfetch_date- Date when listing was scrapedCategories- Property categoriesLease options- Available lease optionsBedrooms- Number of bedroomsBathrooms- Number of bathroomsGarage- Garage/parking informationFurnished- Furnished statusAmenities- Available amenitiesAddress- Property addressReference- Listing reference IDdetails- JSON object with all extracted table data
Meqasa extracts data from dynamic table structures. The
details field contains all key-value pairs found in listing detail tables, allowing for flexible schema evolution.Examples
Scrape from default URL CSV:Progress Tracking
Both spiders display real-time progress with:- Current item number
- Total items to scrape
- Failed requests count
- Estimated completion percentage