scrapy crawl commands. This is ideal for automation, scripting, and CI/CD pipelines.
URL Collection Spiders
URL spiders collect property listing URLs from search result pages and write them to CSV files.Jiji URL Spider
Parameters
The search results page number to start from. Must be
>= 1.Maximum number of pages to scrape. When set, the spider scrapes exactly this many pages starting from
start_page.Example: max_pages=5 scrapes 5 pages total.Cannot be used together with
total_listing.Expected total number of listings to collect. The spider calculates how many pages are needed based on Jiji’s listings per page.Example:
total_listing=200 might scrape ~9 pages (Jiji shows ~24 listings per page).Cannot be used together with
max_pages.Output
Writes to:outputs/urls/jiji_urls.csv
Meqasa URL Spider
Parameters
The search results page number to start from. Must be
>= 1.Total number of pages to scrape. When set, the spider scrapes exactly this many pages starting from
start_page.Example: total_pages=5 scrapes 5 pages total.Output
Writes to:outputs/urls/meqasa_urls.csv
Listing Detail Spiders
Listing spiders read URLs from CSV files and scrape full property details.Jiji Listing Spider
Parameters
Path to the CSV file containing URLs to scrape. Must have a
url column.Default: outputs/urls/jiji_urls.csvCan be absolute or relative to project root.Output
Writes to:outputs/data/jiji_data.csv
Key fields:
urlfetch_datetitlelocationhouse_typebedroomsbathroomspriceproperties(serialized mapping)amenities(serialized list)description
After scraping completes, the interactive runner (
main.py) automatically runs clean.py to produce a cleaned dataset at outputs/data/raw.csv. This cleaning step does not run when using direct commands.Meqasa Listing Spider
Parameters
Path to the CSV file containing URLs to scrape. Must have a
url column.Default: outputs/urls/meqasa_urls.csvCan be absolute or relative to project root.Output
Writes to:outputs/data/meqasa_data.csv
Base fields:
urlTitlePriceRateDescriptionfetch_date
Custom CSV Paths
You can specify custom paths for both input (URL CSVs) and output (data CSVs) by modifying the spider arguments.Reading from Custom URL CSV
The CSV must contain a
url column. Other columns are ignored.Example: Full Workflow
Automation Example
Create a bash script to run the full workflow:scrape.sh
Error Handling
Common errors:CSV Output Behavior
- Incremental writes: Each scraped item is written immediately (item-by-item)
- URL deduplication: The same URL is not added twice to URL CSVs
- Append mode: Listing spiders append to existing data CSVs by default
- Auto-created directories:
outputs/urls/andoutputs/data/are created automatically if missing