URL CSV Schema
URL spiders output a simple 3-column schema for both Jiji and Meqasa.Files
outputs/urls/jiji_urls.csvoutputs/urls/meqasa_urls.csv
Fields
Full URL of the listing page. Used as the unique identifier for deduplication.
Search result page number where this URL was found. Useful for understanding data distribution.
Date when the URL was collected, in
YYYY-MM-DD format.Example
Jiji Listing Schema
Jiji listings are extracted from individual listing pages and saved tooutputs/data/jiji_data.csv.
Output File
outputs/data/jiji_data.csv
Fields
Listing page URL (unique identifier).
Date when the listing was scraped, in
YYYY-MM-DD format.Property listing title as displayed on Jiji.Example:
"3 Bedroom Apartment for Rent in East Legon"Geographic location extracted from the listing metadata.Example:
"East Legon, Accra"Type of property (e.g., apartment, house, commercial).Extracted from icon attributes or the
Subtype / Type property field.Example: "Apartment"Number of bedrooms.Extracted from icon attributes or the
Bedrooms property field.Example: "3 Bedrooms"Number of bathrooms.Extracted from icon attributes or the
Bathrooms / Toilets property field.Example: "2 Bathrooms"Rental price as displayed on the listing.Example:
"GH₵ 3,500"Key-value mapping of all property attributes extracted from the “Advert Attributes” section.Serialized as JSON for CSV storage.Example:
List of amenities/features extracted from the tags section.Scraped using Playwright’s async page methods to ensure dynamic content is loaded.Serialized as JSON for CSV storage.Example:
Full text description of the property provided by the seller.Whitespace is normalized before saving.
Extraction Code Example
Meqasa Listing Schema
Meqasa listings are extracted from individual property pages and saved tooutputs/data/meqasa_data.csv.
Output File
outputs/data/meqasa_data.csv
Fields
Meqasa uses a hybrid schema: some fields are fixed, while others are dynamically discovered from the listing’s detail table.Fixed Fields
Listing page URL (unique identifier).
Property title extracted from the
<h1> tag.Example: "4 Bedroom House to Let in Cantonments"Rental price displayed in the price wrapper.Example:
"$2,500"Billing frequency (e.g., per month, per year).Example:
"per month"Full property description text.
Date when the listing was scraped, in
YYYY-MM-DD format.Common Dynamic Fields
These fields are extracted from the listing’s detail table when available:Property category (e.g., “Apartment”, “House”, “Commercial”).
Available lease terms (e.g., “Daily, Monthly, Yearly”).
Number of bedrooms.Example:
"4"Number of bathrooms.Example:
"3"Garage or parking information.Example:
"2 Car Garage"Furnishing status.Example:
"Fully Furnished"Comma-separated list of amenities.Example:
"Air Conditioning, Swimming Pool, Generator"Full address of the property.
Meqasa’s internal reference ID for the listing.
Complete key-value mapping of all fields from the detail table.Serialized as JSON for CSV storage. This captures any additional fields not explicitly mapped.
Dynamic Field Extraction
Meqasa listings use a flexible table structure. The spider extracts all available fields:Helper Method: Case-Insensitive Lookup
Cleaned Data Schema (Jiji Only)
After a Jiji listing scrape completes, theclean.py script automatically processes jiji_data.csv and outputs a cleaned version to outputs/data/raw.csv.
Output File
outputs/data/raw.csv
Purpose
The cleaning script:- Normalizes field values (e.g., extract numeric values from “3 Bedrooms”)
- Standardizes currency formats
- Parses JSON fields from raw data
- Removes duplicates and invalid records
- Prepares data for downstream analysis or machine learning
Cleaning is currently Jiji-only. Meqasa data does not have an automated cleaning pipeline.
Automatic Execution
Cleaning runs automatically after Jiji listing scrapes:Data Serialization
Complex Python types (dicts, lists) are serialized to JSON before writing to CSV:- Dictionaries and lists can be stored in CSV cells
- Non-ASCII characters (e.g., Ghanaian cedi symbol ₵) are preserved
- Whitespace in strings is normalized
Reading Serialized Fields
To parse JSON fields in Python:Schema Comparison Table
- URL Schemas
- Jiji Listings
- Meqasa Listings
Both Jiji and Meqasa use the same URL schema:
| Field | Type | Description |
|---|---|---|
url | string | Listing URL (unique) |
page | integer | Search result page number |
fetch_date | string | Date collected (YYYY-MM-DD) |
Best Practices
Always preserve the url field
Always preserve the url field
The
url field is used for deduplication across runs. Never modify or remove URLs from existing CSV files if you plan to use resume mode.Parse JSON fields carefully
Parse JSON fields carefully
Fields like
properties, amenities, and details contain JSON. Always validate and handle JSON parsing errors when reading these fields.Handle missing fields gracefully
Handle missing fields gracefully
Not all listings have all fields. Use
pd.notna() or .get() with defaults when accessing fields in pandas or dictionaries.Normalize field names for Meqasa
Normalize field names for Meqasa
Meqasa’s dynamic fields may have slight variations (e.g., “Bedroom” vs “Bedrooms”). Use case-insensitive matching or the
_get_detail() helper pattern.Next Steps
Workflow
Learn about the two-phase scraping workflow
Spider Architecture
Understand how spiders collect and write data