Overview
All scraper outputs are written to CSV files under theoutputs/ directory. The structure separates URL collection from listing detail extraction.
Directory Structure
The scraper automatically creates two output directories at startup:URL Collection Files
outputs/urls/jiji_urls.csv
url- Full listing URLpage- Result page number where URL was foundfetch_date- ISO timestamp of collection
outputs/urls/meqasa_urls.csv
Listing Detail Files
outputs/data/jiji_data.csv
url- Listing URLfetch_date- ISO timestamptitle- Property titlelocation- Geographic locationhouse_type- Property typebedrooms- Number of bedroomsbathrooms- Number of bathroomsprice- Listed priceproperties- Serialized property attributesamenities- Serialized amenities listdescription- Full listing description
outputs/data/meqasa_data.csv
urlTitlePriceRateDescriptionfetch_date
outputs/data/raw.csv
clean.py script.
URL Deduplication
All URL collection spiders implement deduplication to prevent duplicate entries:read_url_set() function loads existing URLs into a set for O(1) duplicate checking during scraping.
Resume Queue Files
Resume mode generates temporary queue files containing only unscraped URLs:outputs/urls/jiji_resume_queue.csv
jiji_urls.csv against jiji_data.csv. Contains URLs present in the URL file but missing from the data file.
outputs/urls/meqasa_resume_queue.csv
Queue File Lifecycle
- Created before resume scraping starts
- Used as input to listing spiders
- Automatically deleted after scraping completes
CSV File Format
All CSV files use:- UTF-8 encoding
- Header row with column names
- QUOTE_ALL quoting strategy for data integrity
- Newline normalization via
newline=""parameter
File Naming Conventions
| File Pattern | Purpose |
|---|---|
{site}_urls.csv | URL collection output |
{site}_data.csv | Raw listing detail output |
{site}_resume_queue.csv | Temporary resume queue |
raw.csv | Cleaned final dataset |
{site} is either jiji or meqasa.