Skip to main content
ScrapeAccraProperties outputs data to CSV files with well-defined schemas. This page documents all fields for URL CSVs, listing CSVs, and cleaned data.

URL CSV Schema

URL spiders output a simple 3-column schema for both Jiji and Meqasa.

Files

  • outputs/urls/jiji_urls.csv
  • outputs/urls/meqasa_urls.csv

Fields

url
string
required
Full URL of the listing page. Used as the unique identifier for deduplication.
page
integer
required
Search result page number where this URL was found. Useful for understanding data distribution.
fetch_date
string
required
Date when the URL was collected, in YYYY-MM-DD format.

Example

url,page,fetch_date
https://jiji.com.gh/greater-accra/houses-apartments-for-rent/property-123,1,2024-01-15
https://jiji.com.gh/greater-accra/houses-apartments-for-rent/property-456,1,2024-01-15
https://jiji.com.gh/greater-accra/houses-apartments-for-rent/property-789,2,2024-01-15

Jiji Listing Schema

Jiji listings are extracted from individual listing pages and saved to outputs/data/jiji_data.csv.

Output File

  • outputs/data/jiji_data.csv

Fields

url
string
required
Listing page URL (unique identifier).
fetch_date
string
required
Date when the listing was scraped, in YYYY-MM-DD format.
title
string
Property listing title as displayed on Jiji.Example: "3 Bedroom Apartment for Rent in East Legon"
location
string
Geographic location extracted from the listing metadata.Example: "East Legon, Accra"
house_type
string
Type of property (e.g., apartment, house, commercial).Extracted from icon attributes or the Subtype / Type property field.Example: "Apartment"
bedrooms
string
Number of bedrooms.Extracted from icon attributes or the Bedrooms property field.Example: "3 Bedrooms"
bathrooms
string
Number of bathrooms.Extracted from icon attributes or the Bathrooms / Toilets property field.Example: "2 Bathrooms"
price
string
Rental price as displayed on the listing.Example: "GH₵ 3,500"
properties
JSON object
Key-value mapping of all property attributes extracted from the “Advert Attributes” section.Serialized as JSON for CSV storage.Example:
{
  "Bedrooms": "3",
  "Bathrooms": "2",
  "Type": "Apartment",
  "Subtype": "Flat & Apartment",
  "Condition": "Newly Built"
}
amenities
JSON array
List of amenities/features extracted from the tags section.Scraped using Playwright’s async page methods to ensure dynamic content is loaded.Serialized as JSON for CSV storage.Example:
["Air Conditioning", "Parking Space", "24/7 Security", "Water Supply"]
description
string
Full text description of the property provided by the seller.Whitespace is normalized before saving.

Extraction Code Example

# From jiji_listing.py:72-113
properties = {
    prop.css(".b-advert-attribute__key::text")
    .get("")
    .strip()
    .rstrip(":"): prop.css(".b-advert-attribute__value::text")
    .get("")
    .strip()
    for prop in response.css(".b-advert-attribute")
    if prop.css(".b-advert-attribute__key::text").get()
}

house_type, bathrooms, bedrooms = None, None, None
for detail in response.css(
    ".b-advert-icon-attribute span::text, .b-advert-icon-attribute__value::text"
).getall():
    dl = detail.lower()
    if "bed" in dl:
        bedrooms = detail.strip()
    elif "bath" in dl:
        bathrooms = detail.strip()
    elif not house_type:
        house_type = detail.strip()

amenities = []
if page:
    try:
        if await asyncio.wait_for(
            page.query_selector(".b-advert-attributes--tags"), timeout=1.5
        ):
            elements = await page.query_selector_all(
                ".b-advert-attributes__tag"
            )
            texts = await asyncio.gather(
                *[el.text_content() for el in elements],
                return_exceptions=True,
            )
            amenities = [
                t.strip() for t in texts if isinstance(t, str) and t.strip()
            ]
    except Exception:
        pass

Meqasa Listing Schema

Meqasa listings are extracted from individual property pages and saved to outputs/data/meqasa_data.csv.

Output File

  • outputs/data/meqasa_data.csv

Fields

Meqasa uses a hybrid schema: some fields are fixed, while others are dynamically discovered from the listing’s detail table.

Fixed Fields

url
string
required
Listing page URL (unique identifier).
Title
string
Property title extracted from the <h1> tag.Example: "4 Bedroom House to Let in Cantonments"
Price
string
Rental price displayed in the price wrapper.Example: "$2,500"
Rate
string
Billing frequency (e.g., per month, per year).Example: "per month"
Description
string
Full property description text.
fetch_date
string
required
Date when the listing was scraped, in YYYY-MM-DD format.

Common Dynamic Fields

These fields are extracted from the listing’s detail table when available:
Categories
string
Property category (e.g., “Apartment”, “House”, “Commercial”).
Lease options
string
Available lease terms (e.g., “Daily, Monthly, Yearly”).
Bedrooms
string
Number of bedrooms.Example: "4"
Bathrooms
string
Number of bathrooms.Example: "3"
Garage
string
Garage or parking information.Example: "2 Car Garage"
Furnished
string
Furnishing status.Example: "Fully Furnished"
Amenities
string
Comma-separated list of amenities.Example: "Air Conditioning, Swimming Pool, Generator"
Address
string
Full address of the property.
Reference
string
Meqasa’s internal reference ID for the listing.
details
JSON object
Complete key-value mapping of all fields from the detail table.Serialized as JSON for CSV storage. This captures any additional fields not explicitly mapped.

Dynamic Field Extraction

Meqasa listings use a flexible table structure. The spider extracts all available fields:
# From meqasa_listing.py:77-89
details: dict[str, str] = {}
for row in response.css("table.table tr"):
    if header := row.css(
        "td[style*='font-weight: bold']::text, th::text"
    ).get():
        values = row.css(
            "td:nth-child(2) ::text, td:nth-child(2) li::text"
        ).getall()
        details[header.strip().rstrip(":")] = ", ".join(
            v.strip() for v in values if v.strip()
        )

Helper Method: Case-Insensitive Lookup

# From meqasa_listing.py:49-55
@staticmethod
def _get_detail(details: dict[str, str], *candidates: str) -> str | None:
    lowered = {key.casefold(): value for key, value in details.items()}
    for key in candidates:
        if value := lowered.get(key.casefold()):
            return value
    return None
This allows the spider to handle variations in field naming (e.g., “Bedroom” vs “Bedrooms”).

Cleaned Data Schema (Jiji Only)

After a Jiji listing scrape completes, the clean.py script automatically processes jiji_data.csv and outputs a cleaned version to outputs/data/raw.csv.

Output File

  • outputs/data/raw.csv

Purpose

The cleaning script:
  • Normalizes field values (e.g., extract numeric values from “3 Bedrooms”)
  • Standardizes currency formats
  • Parses JSON fields from raw data
  • Removes duplicates and invalid records
  • Prepares data for downstream analysis or machine learning
Cleaning is currently Jiji-only. Meqasa data does not have an automated cleaning pipeline.

Automatic Execution

Cleaning runs automatically after Jiji listing scrapes:
# From main.py:220-243
def run_jiji_cleaning() -> None:
    try:
        from clean import clean_jiji_csv
    except Exception as exc:
        console.print(f"[red]Skipping Jiji cleaning:[/] unable to load clean.py ({exc})")
        return

    jiji_data_csv = SITE_CONFIGS["jiji"].data_csv
    if not jiji_data_csv.exists():
        console.print(
            f"[yellow]Skipping Jiji cleaning: {relpath(jiji_data_csv)} not found.[/]"
        )
        return

    try:
        cleaned_df = clean_jiji_csv(jiji_data_csv, JIJI_RAW_OUTPUT_CSV)
    except Exception as exc:
        console.print(f"[red]Jiji cleaning failed:[/] {exc}")
        return

    console.print(
        f"[green]Jiji cleaned CSV saved:[/] {relpath(JIJI_RAW_OUTPUT_CSV)}"
        f" ({len(cleaned_df):,} rows)"
    )

Data Serialization

Complex Python types (dicts, lists) are serialized to JSON before writing to CSV:
# From base_spider.py:106-116
@staticmethod
def _serialize_for_csv(value: Any) -> Any:
    if value is None:
        return ""
    if isinstance(value, dict):
        return json.dumps(value, ensure_ascii=False, sort_keys=True)
    if isinstance(value, (list, tuple, set)):
        return json.dumps(list(value), ensure_ascii=False)
    if isinstance(value, str):
        return " ".join(value.split())  # Normalize whitespace
    return value
This ensures that:
  • Dictionaries and lists can be stored in CSV cells
  • Non-ASCII characters (e.g., Ghanaian cedi symbol ₵) are preserved
  • Whitespace in strings is normalized

Reading Serialized Fields

To parse JSON fields in Python:
import json
import pandas as pd

df = pd.read_csv("outputs/data/jiji_data.csv")

# Parse properties field
df['properties_dict'] = df['properties'].apply(
    lambda x: json.loads(x) if pd.notna(x) and x else {}
)

# Parse amenities field
df['amenities_list'] = df['amenities'].apply(
    lambda x: json.loads(x) if pd.notna(x) and x else []
)

Schema Comparison Table

Both Jiji and Meqasa use the same URL schema:
FieldTypeDescription
urlstringListing URL (unique)
pageintegerSearch result page number
fetch_datestringDate collected (YYYY-MM-DD)

Best Practices

The url field is used for deduplication across runs. Never modify or remove URLs from existing CSV files if you plan to use resume mode.
Fields like properties, amenities, and details contain JSON. Always validate and handle JSON parsing errors when reading these fields.
Not all listings have all fields. Use pd.notna() or .get() with defaults when accessing fields in pandas or dictionaries.
Meqasa’s dynamic fields may have slight variations (e.g., “Bedroom” vs “Bedrooms”). Use case-insensitive matching or the _get_detail() helper pattern.

Next Steps

Workflow

Learn about the two-phase scraping workflow

Spider Architecture

Understand how spiders collect and write data

Build docs developers (and LLMs) love