Skip to main content

Overview

This project uses the Boston Housing Dataset from Kaggle to predict median home values based on various property and neighborhood characteristics.

Dataset Statistics

  • Samples: 506 housing records
  • Features: 13 independent variables
  • Target: medv (Median home value in $1000s)
  • Missing Values: 5 missing values in rm feature (0.99%)

Real-Life Applications

House price prediction models are actively used by:
  • Zillow - Automated home valuation (Zestimate)
  • MagicBricks - Property price estimation in India
  • Redfin - Real estate market analysis
  • Real estate agencies - Property appraisal and investment analysis

Target Variable

medv

Median value of owner-occupied homes in $1000s
  • Mean: $22,533
  • Range: 5,0005,000 - 50,000
  • Median: $21,200

Distribution

The target shows moderate variance with some outliers at the upper end (40 outliers, 7.91%)

Dataset Features

The dataset contains 13 features describing various aspects of housing and neighborhood characteristics:
FeatureDescriptionType
crimPer capita crime rate by townContinuous
znProportion of residential land zoned for lots over 25,000 sq.ftContinuous
indusProportion of non-retail business acres per townContinuous
chasCharles River dummy variable (1 if tract bounds river; 0 otherwise)Binary
noxNitric oxides concentration (parts per 10 million)Continuous
rmAverage number of rooms per dwellingContinuous
ageProportion of owner-occupied units built prior to 1940Continuous
disWeighted distances to five Boston employment centersContinuous
radIndex of accessibility to radial highwaysDiscrete
taxFull-value property-tax rate per $10,000Continuous
ptratioPupil-teacher ratio by townContinuous
b1000(Bk - 0.63)^2 where Bk is proportion of Black residentsContinuous
lstatPercentage of lower status of the populationContinuous
The b feature captures historical racial demographics in Boston housing markets and reflects socioeconomic patterns from that era.

Data Quality

  • rm (Average rooms): 5 missing values (0.99%)
  • Strategy: Missing values can be imputed using median or mean
  • Impact: Minimal due to low percentage
Using IQR (Interquartile Range) method:
  • crim: 66 outliers (13.04%) - High crime rate areas
  • zn: 68 outliers (13.44%) - Large residential lots
  • b: 77 outliers (15.22%) - Demographic distribution
  • medv: 40 outliers (7.91%) - High-value properties
  • rm: 30 outliers (5.93%) - Unusually large homes
  • Float64: 11 features (crim, zn, indus, nox, rm, age, dis, ptratio, b, lstat, medv)
  • Int64: 3 features (chas, rad, tax)
  • All features are numerical, no categorical encoding required

Key Statistics

# Example: Loading the dataset
import pandas as pd
import kagglehub

# Download dataset
path = kagglehub.dataset_download("arunjangir245/boston-housing-dataset")
df = pd.read_csv(f"{path}/BostonHousing.csv")

print(f"Dataset shape: {df.shape}")
print(f"\nTarget statistics:")
print(df['medv'].describe())
Historical Context: This dataset represents Boston housing data from the 1970s and contains features that reflect the socioeconomic patterns of that era.

Next Steps

Feature Analysis

Explore feature correlations and their impact on house prices

Evaluation Metrics

Learn about the metrics used to evaluate model performance

Build docs developers (and LLMs) love