Overview
URL collection spiders visit paginated search result pages and extract individual listing URLs. The URLs are saved to CSV files inoutputs/urls/ for use in the listing scrape phase.
Each spider supports:
- Custom start page
- Auto-detection of total pages
- Fixed page count modes
- URL deduplication
Jiji URL Spider
Collects listing URLs from Jiji Ghana’s Greater Accra rental search.Arguments
The page number to start collecting URLs from. Useful for resuming interrupted crawls.
Fixed number of pages to scrape from
start_page. Mutually exclusive with total_listing.Expected number of listings to collect. The spider automatically calculates required pages (20 listings per page). Mutually exclusive with
max_pages.Auto-detection Mode
If neithermax_pages nor total_listing is provided, the spider automatically detects the total number of results from the first page and calculates required pages.
Output
File:outputs/urls/jiji_urls.csv
Fields:
url- Full listing URLpage- Page number where URL was foundfetch_date- Date when URL was collected (YYYY-MM-DD)
Examples
Auto-detect total pages:Meqasa URL Spider
Collects listing URLs from Meqasa’s Greater Accra rental search.Arguments
The page number to start collecting URLs from.
Fixed number of pages to scrape from
start_page. If not provided, auto-detection is used.Auto-detection Mode
Iftotal_pages is not provided, the spider reads the total listing count from the first page and calculates required pages (16 listings per page).
Output
File:outputs/urls/meqasa_urls.csv
Fields:
url- Full listing URLpage- Page number where URL was foundfetch_date- Date when URL was collected (YYYY-MM-DD)
Examples
Auto-detect total pages:URL CSVs grow incrementally during the crawl. Duplicate URLs are automatically filtered based on the
url column.