Data Cleaning - ScrapeAccraProperties

Overview

The clean.py module processes raw Jiji listing data to produce analysis-ready datasets. It handles:

Locality extraction and normalization
Sale and short-term rental filtering
Property attribute expansion
Standardized output formatting

Data cleaning is Jiji-only. Meqasa data (meqasa_data.csv) is not currently processed by the cleaning pipeline.

Location: clean.py
Input: outputs/data/jiji_data.csv
Output: outputs/data/raw.csv

Quick Start

Automatic Cleaning

The cleaning script runs automatically after Jiji listing scrapes complete:

python main.py
# Select: "Scrape listing details" → "Jiji only"
# After scrape completes, clean.py runs automatically

Manual Cleaning

python -c "from clean import clean_jiji_csv; clean_jiji_csv()"

Custom Paths

from clean import clean_jiji_csv

clean_jiji_csv(
    input_csv="outputs/data/jiji_data.csv",
    output_csv="outputs/data/custom_output.csv"
)

Cleaning Pipeline

1. Locality Extraction

Extracts neighborhood from full location string:

def extract_locality(location):
    parts = str(location).split(",")
    return parts[1].strip() if len(parts) >= 3 else parts[0].strip()

Examples:

Input Location	Extracted Locality
`Accra Metropolitan, Osu, Greater Accra`	`Osu`
`Accra Metropolitan, Greater Accra`	`Accra Metropolitan`
`East Legon`	`East Legon`

Implementation:

df["locality"] = df["location"].apply(extract_locality)
df = df.drop(columns=["location"])

2. House Type Normalization

Fills missing house types with default value:

df["house_type"] = df["house_type"].fillna("Bedsitter")

3. Bedroom/Bathroom Cleaning

Extracts numeric values and removes listings with missing bed/bath counts:

df = df.dropna(subset=["bathrooms", "bedrooms"])
for col in ["bathrooms", "bedrooms"]:
    df[col] = df[col].str.split().str[0]

Examples:

Raw Value	Cleaned Value
`3 Bedrooms`	`3`
`2 Bathrooms`	`2`
`Studio`	Dropped (NaN)

4. Price Normalization

Converts price strings to numeric values:

df["price"] = (
    df["price"]
    .str.replace("GH₵ ", "", regex=False)
    .str.replace(",", "", regex=False)
    .astype(float)
)

Examples:

Raw Price	Cleaned Price
`GH₵ 3,500`	`3500.0`
`GH₵ 12,000`	`12000.0`

5. Property Expansion

Expands serialized property dictionary into separate columns:

def expand_column(df, col, sep=None):
    if sep:
        encoded = df[col].str.strip().str.get_dummies(sep=sep)
        encoded.columns = encoded.columns.str.strip()
    else:
        encoded = df[col].apply(ast.literal_eval).apply(pd.Series)
    return df.join(encoded).drop(columns=[col])

df = expand_column(df, "properties")

Example: Before:

properties = '{"Condition": "Newly Built", "Furnishing": "Furnished"}'

After (new columns):

Condition       Furnishing
Newly Built     Furnished

6. Amenity Expansion

Expands comma-separated amenities into binary columns:

df = expand_column(df, "amenities", sep=",")

Example: Before:

amenities = "Wi-Fi, 24-hour Electricity, Hot Water"

After (new columns):

Wi-Fi    24-hour Electricity    Hot Water
1        1                      1

7. Facilities Merging

Merges Facilities column with existing amenity columns (if present):

if "Facilities" in df.columns:
    facilities = df["Facilities"].str.strip().str.get_dummies(sep=",")
    facilities.columns = facilities.columns.str.strip()
    for col in facilities.columns:
        if col in df.columns:
            df[col] = df[[col]].join(facilities[[col]], rsuffix="_new").max(axis=1)
        else:
            df[col] = facilities[col]
    df = df.drop(columns=["Facilities"])

Takes maximum value when same amenity appears in both sources.

8. Sale/Period Filtering

Removes listings containing sale or short-term rental keywords in title or description.

Sale Patterns

SALE_PATTERNS = [
    "for sale", "on sale", "sale", "sale only", "selling", "sell", "sold",
    "property for sale", "house for sale", "apartment for sale",
    "buy", "buyer", "title deed", "deed", "closing", "escrow", "transfer",
    "down payment", "mortgage", "loan", "financing", "cash only",
    "investment", "roi", "yield", "capital gain",
    "airbnb", "air bnb", "booking.com", "vrbo"
]

Period Patterns

PERIOD_PATTERNS = [
    "per night", "nightly", "per day", "daily", "by the day",
    "short term", "short-term", "weekly", "per week", "weekend",
    "airbnb", "air bnb", "business short", "holidays", "short stay",
    "holiday rental", "vacation rental", "tourist rental",
    "guesthouse", "guest house"
]

Implementation

regex = "|".join(re.escape(p) for p in SALE_PATTERNS + PERIOD_PATTERNS)
text = (df["title"].fillna("") + " " + df["description"].fillna("")).str.lower()
df = df[~text.str.contains(regex, na=False)]

Removes listings like:

“3 Bedroom Apartment for Sale”
“Luxury Flat - Short Term Available”
“Airbnb Investment Property”
“Daily/Nightly Rental - GH₵ 200 per night”

9. Locality Normalization

Standardizes neighborhood names using replacement mappings:

LOCALITY_REPLACEMENTS = {
    "Dworwulu": "Dzorwulu",
    "Lartebiokoshie": "Lartebiokorshie",
    "Ashongman Estate": "Ashongman",
    "Ashoman Estate": "Ashongman",
    "Old Ashongman": "Ashongman",
    "Old Ashoman": "Ashongman",
    "Ashoman": "Ashongman",
    "Greater Accra": "Other",
    "Ledzokuku-Krowor": "Teshie",
    "South Shiashie": "East Legon",
    "Okponglo": "East Legon",
    "Little Legon": "East Legon",
    "Abofu": "Achimota",
    "Akweteyman": "Achimota",
    "Banana Inn": "Dansoman",
    "South La": "Labadi",
    "Bubuashie": "Kaneshie"
}

df["locality"] = df["locality"].replace(LOCALITY_REPLACEMENTS)

Purpose: Standardizes spelling variations and consolidates sub-neighborhoods Examples:

Raw Locality	Normalized Locality
`Dworwulu`	`Dzorwulu`
`Old Ashongman`	`Ashongman`
`South Shiashie`	`East Legon`
`Greater Accra`	`Other`

10. Location Code Assignment

Creates broader location groupings:

LOC_REPLACEMENTS = {
    "Circle": "Nkrumah Circle",
    "Anyaa": "Anyaa Market",
    "Ridge": "North Ridge",
    "Dome": "Dome Market",
    "Abokobi": "Abokobi Station",
    "Nungua": "Nungua Central",
    "Other": "Accra Metropolitan"
}

df["loc"] = df["locality"].replace(LOC_REPLACEMENTS)

Use case: Creates loc field for broader geographic grouping while preserving granular locality field

11. Column Ordering

Selects and orders final output columns:

FINAL_COLUMNS = [
    "url", "fetch_date", "house_type", "bathrooms", "bedrooms", "price",
    "locality", "Condition", "Furnishing", "Property Size",
    "24-hour Electricity", "Air Conditioning", "Apartment", "Balcony",
    "Chandelier", "Dining Area", "Dishwasher", "Hot Water",
    "Kitchen Cabinets", "Kitchen Shelf", "Microwave", "Pop Ceiling",
    "Pre-Paid Meter", "Refrigerator", "TV", "Tiled Floor",
    "Wardrobe", "Wi-Fi", "loc"
]

df = df[[col for col in FINAL_COLUMNS if col in df.columns]].copy()

Only includes columns that exist in the dataset.

Output Schema

Core Fields

url

str

Original listing URL

fetch_date

str

Date listing was scraped (YYYY-MM-DD format)

house_type

str

Property type (e.g., “Apartment”, “Self Contain”, “Bedsitter”)

bathrooms

str

Number of bathrooms (numeric string)

bedrooms

str

Number of bedrooms (numeric string)

price

float

Monthly rental price in GH₵

locality

str

Normalized neighborhood name

loc

str

Broader location grouping

Property Attributes

Condition

str

Property condition (e.g., “Newly Built”, “Renovated”)

Furnishing

str

Furnishing status (e.g., “Furnished”, “Semi-Furnished”, “Unfurnished”)

Property Size

str

Size specification (if provided)

Amenity Columns (Binary)

All amenity columns are binary (0/1 or NaN):

24-hour Electricity
Air Conditioning
Apartment
Balcony
Chandelier
Dining Area
Dishwasher
Hot Water
Kitchen Cabinets
Kitchen Shelf
Microwave
Pop Ceiling
Pre-Paid Meter
Refrigerator
TV
Tiled Floor
Wardrobe
Wi-Fi

API Reference

clean()

def clean(df: pd.DataFrame) -> pd.DataFrame

Applies full cleaning pipeline to raw Jiji DataFrame. Parameters:

df - Raw DataFrame from jiji_data.csv

Returns:

Cleaned DataFrame with normalized localities, expanded amenities, and filtered listings

Example:

import pandas as pd
from clean import clean

raw_df = pd.read_csv("outputs/data/jiji_data.csv")
cleaned_df = clean(raw_df)

clean_jiji_csv()

def clean_jiji_csv(
    input_csv: str | Path = DEFAULT_JIJI_INPUT,
    output_csv: str | Path = DEFAULT_RAW_OUTPUT,
) -> pd.DataFrame

End-to-end cleaning function: reads input CSV, applies cleaning, writes output CSV. Parameters:

input_csv - Path to raw Jiji data (default: outputs/data/jiji_data.csv)
output_csv - Path to write cleaned data (default: outputs/data/raw.csv)

Returns:

Cleaned DataFrame (also written to output_csv)

Example:

from clean import clean_jiji_csv

cleaned_df = clean_jiji_csv(
    input_csv="data/jiji_raw.csv",
    output_csv="data/jiji_clean.csv"
)
print(f"Cleaned {len(cleaned_df)} listings")

extract_locality()

def extract_locality(location: str) -> str

Extracts neighborhood from comma-separated location string. Parameters:

location - Full location string (e.g., “Accra Metropolitan, Osu, Greater Accra”)

Returns:

Extracted locality (2nd segment if 3+ parts, otherwise 1st segment)

Example:

from clean import extract_locality

locality = extract_locality("Accra Metropolitan, East Legon, Greater Accra")
print(locality)  # "East Legon"

expand_column()

def expand_column(
    df: pd.DataFrame,
    col: str,
    sep: str | None = None
) -> pd.DataFrame

Expands serialized column into separate binary/value columns. Parameters:

df - DataFrame containing column to expand
col - Column name to expand
sep - Separator for string splitting (if None, treats as JSON serialized dict/list)

Returns:

DataFrame with expanded columns (original column dropped)

Example:

import pandas as pd
from clean import expand_column

# Expand JSON dict column
df = pd.DataFrame({"props": ['{"Bedrooms": "3", "Condition": "New"}']})
df = expand_column(df, "props")
# New columns: Bedrooms, Condition

# Expand comma-separated column
df = pd.DataFrame({"amenities": ["Wi-Fi, Hot Water, Parking"]})
df = expand_column(df, "amenities", sep=",")
# New columns: Wi-Fi, Hot Water, Parking (all = 1)

Constants Reference

File Paths

PROJECT_ROOT = Path(__file__).resolve().parent
DEFAULT_JIJI_INPUT = PROJECT_ROOT / "outputs" / "data" / "jiji_data.csv"
DEFAULT_RAW_OUTPUT = PROJECT_ROOT / "outputs" / "data" / "raw.csv"

Filtering Patterns

SALE_PATTERNS (37 patterns)

SALE_PATTERNS = [
    "for sale", "on sale", "sale", "sale only", "selling", "sell", "sold",
    "property for sale", "house for sale", "apartment for sale",
    "buy", "buyer", "title deed", "deed", "closing", "escrow", "transfer",
    "down payment", "mortgage", "loan", "financing", "cash only",
    "investment", "roi", "yield", "capital gain",
    "airbnb", "air bnb", "booking.com", "vrbo"
]

PERIOD_PATTERNS (20 patterns)

PERIOD_PATTERNS = [
    "per night", "nightly", "per day", "daily", "by the day",
    "short term", "short-term", "weekly", "per week", "weekend",
    "airbnb", "air bnb", "business short", "holidays", "short stay",
    "holiday rental", "vacation rental", "tourist rental",
    "guesthouse", "guest house"
]

Normalization Mappings

LOCALITY_REPLACEMENTS (16 mappings)

LOCALITY_REPLACEMENTS = {
    "Dworwulu": "Dzorwulu",
    "Lartebiokoshie": "Lartebiokorshie",
    "Ashongman Estate": "Ashongman",
    "Ashoman Estate": "Ashongman",
    "Old Ashongman": "Ashongman",
    "Old Ashoman": "Ashongman",
    "Ashoman": "Ashongman",
    "Greater Accra": "Other",
    "Ledzokuku-Krowor": "Teshie",
    "South Shiashie": "East Legon",
    "Okponglo": "East Legon",
    "Little Legon": "East Legon",
    "Abofu": "Achimota",
    "Akweteyman": "Achimota",
    "Banana Inn": "Dansoman",
    "South La": "Labadi",
    "Bubuashie": "Kaneshie"
}

LOC_REPLACEMENTS (7 mappings)

LOC_REPLACEMENTS = {
    "Circle": "Nkrumah Circle",
    "Anyaa": "Anyaa Market",
    "Ridge": "North Ridge",
    "Dome": "Dome Market",
    "Abokobi": "Abokobi Station",
    "Nungua": "Nungua Central",
    "Other": "Accra Metropolitan"
}

FINAL_COLUMNS (29 columns)

FINAL_COLUMNS = [
    "url", "fetch_date", "house_type", "bathrooms", "bedrooms", "price",
    "locality", "Condition", "Furnishing", "Property Size",
    "24-hour Electricity", "Air Conditioning", "Apartment", "Balcony",
    "Chandelier", "Dining Area", "Dishwasher", "Hot Water",
    "Kitchen Cabinets", "Kitchen Shelf", "Microwave", "Pop Ceiling",
    "Pre-Paid Meter", "Refrigerator", "TV", "Tiled Floor",
    "Wardrobe", "Wi-Fi", "loc"
]

Usage Examples

Example 1: Basic Cleaning

from clean import clean_jiji_csv

# Clean with defaults
df = clean_jiji_csv()
print(f"Cleaned {len(df)} listings")
print(f"Localities: {df['locality'].unique()}")

Example 2: Custom Filtering

import pandas as pd
from clean import clean

# Load and clean
raw_df = pd.read_csv("outputs/data/jiji_data.csv")
cleaned_df = clean(raw_df)

# Additional filtering
expensive = cleaned_df[cleaned_df["price"] > 5000]
furnished = cleaned_df[cleaned_df["Furnishing"] == "Furnished"]

Example 3: Analyzing Localities

from clean import clean_jiji_csv

df = clean_jiji_csv()

# Count by locality
locality_counts = df["locality"].value_counts()
print(locality_counts.head(10))

# Average price by locality
avg_prices = df.groupby("locality")["price"].mean().sort_values(ascending=False)
print(avg_prices.head(10))

Example 4: Amenity Analysis

from clean import clean_jiji_csv

df = clean_jiji_csv()

# Listings with Wi-Fi
wifi_listings = df[df["Wi-Fi"] == 1]

# Count amenity prevalence
amenity_cols = [
    "Wi-Fi", "24-hour Electricity", "Hot Water",
    "Air Conditioning", "Parking", "Security"
]
for col in amenity_cols:
    if col in df.columns:
        pct = (df[col] == 1).sum() / len(df) * 100
        print(f"{col}: {pct:.1f}%")

Get Started

Core Concepts

Usage

Commands

Configuration

Reference

​Overview

​Quick Start

​Automatic Cleaning

​Manual Cleaning

​Custom Paths

​Cleaning Pipeline

​1. Locality Extraction

​2. House Type Normalization

​3. Bedroom/Bathroom Cleaning

​4. Price Normalization

​5. Property Expansion

​6. Amenity Expansion

​7. Facilities Merging

​8. Sale/Period Filtering

​Sale Patterns

​Period Patterns

​Implementation

​9. Locality Normalization

​10. Location Code Assignment

​11. Column Ordering

​Output Schema

​Core Fields

​Property Attributes

​Amenity Columns (Binary)

​API Reference

​clean()

​clean_jiji_csv()

​extract_locality()

​expand_column()

​Constants Reference

​File Paths

​Filtering Patterns

​Normalization Mappings

​Usage Examples

​Example 1: Basic Cleaning

​Example 2: Custom Filtering

​Example 3: Analyzing Localities

​Example 4: Amenity Analysis

Build docs developers (and LLMs) love

Overview

Quick Start

Automatic Cleaning

Manual Cleaning

Custom Paths

Cleaning Pipeline

1. Locality Extraction

2. House Type Normalization

3. Bedroom/Bathroom Cleaning

4. Price Normalization

5. Property Expansion

6. Amenity Expansion

7. Facilities Merging

8. Sale/Period Filtering

Sale Patterns

Period Patterns

Implementation

9. Locality Normalization

10. Location Code Assignment

11. Column Ordering

Output Schema

Core Fields

Property Attributes

Amenity Columns (Binary)

API Reference

clean()

clean_jiji_csv()

extract_locality()

expand_column()

Constants Reference

File Paths

Filtering Patterns

Normalization Mappings

Usage Examples

Example 1: Basic Cleaning

Example 2: Custom Filtering

Example 3: Analyzing Localities

Example 4: Amenity Analysis