Feature Overview
The Boston Housing dataset contains 13 features that capture different aspects of property characteristics and neighborhood quality. Understanding how these features correlate with house prices is crucial for building effective prediction models.Feature Descriptions
Property Features
- rm: Average number of rooms
- age: Proportion built before 1940
- chas: Near Charles River
Location Features
- crim: Crime rate per capita
- zn: Residential land zoning
- dis: Distance to employment centers
- rad: Highway accessibility index
Environmental Features
- nox: Nitrogen oxide pollution
- indus: Industrial area proportion
Economic Features
- tax: Property tax rate
- ptratio: Student-teacher ratio
- lstat: % lower status population
- b: Demographic index
Correlation Analysis
Correlation analysis reveals which features have the strongest relationship with house prices (medv).
Strong Positive Correlations
Features that increase house prices:rm (0.696) - Average Number of Rooms
rm (0.696) - Average Number of Rooms
Strongest positive correlation with priceWhy it matters: Size is one of the strongest drivers of housing price in any market.
- More rooms → Larger homes → Higher value
- Indicates space, comfort, and higher socioeconomic status
- Used as the single feature in univariate linear regression
zn (0.360) - Residential Land Proportion
zn (0.360) - Residential Land Proportion
- Higher proportion of large residential lots (>25,000 sq.ft)
- Indicates suburban or affluent areas with more space
- Provides privacy, greenery, and lower congestion
b (0.333) - Demographic Index
b (0.333) - Demographic Index
- Historically linked to racial and socioeconomic demographics
- Reflects structural socioeconomic patterns in Boston
- Captures historical housing segregation effects
This feature reflects socioeconomic patterns from the 1970s dataset era rather than physical housing characteristics.
dis (0.250) - Distance to Employment
dis (0.250) - Distance to Employment
- Slight suburban premium effect
- Some buyers prefer quieter neighborhoods further from urban core
- Balances accessibility with residential quality
Strong Negative Correlations
Features that decrease house prices:lstat (-0.738) - % Lower Status Population
lstat (-0.738) - % Lower Status Population
Strongest negative correlation with priceInsight: Income level is the single strongest predictor of house price in this dataset.
- Indicates economic disadvantage in the area
- Lower purchasing power reduces property demand
- Often correlated with crime, school quality, and infrastructure
ptratio (-0.508) - Pupil-Teacher Ratio
ptratio (-0.508) - Pupil-Teacher Ratio
- Higher ratio suggests overcrowded schools
- Lower school quality reduces demand from families
- School performance heavily influences housing markets
indus (-0.484) - Industrial Area Proportion
indus (-0.484) - Industrial Area Proportion
- More industry means noise, pollution, and traffic
- Lower aesthetic and residential appeal
- Reduces quality of life
tax (-0.469) - Property Tax Rate
tax (-0.469) - Property Tax Rate
- Higher taxes increase cost of ownership
- Reduces affordability and buyer demand
- Ongoing financial burden
nox (-0.427) - Air Pollution
nox (-0.427) - Air Pollution
- Poor air quality (nitrogen oxides) decreases livability
- Health concerns reduce neighborhood desirability
crim (-0.388) - Crime Rate
crim (-0.388) - Crime Rate
- Higher crime lowers perceived safety
- Reduces demand and increases risk premium
Correlation Matrix Visualization
Feature Selection Strategy
Different models in this project use different feature selection approaches:1. Univariate Regression
Single Feature:
rm (average rooms)- Correlation: 0.696
- Performance: R² = 0.458 (test)
- Use case: Demonstrates simple linear relationship
2. Feature Selection (Top 6)
Selected Features: Top 6 features by absolute correlation
lstat(-0.738)rm(0.696)ptratio(-0.508)indus(-0.484)tax(-0.469)nox(-0.427)
- Performance: R² = 0.651 (test)
- Use case: Balance between simplicity and performance
3. Multivariate Regression
All Features: All 13 features included
- Performance: R² = 0.710 (test) - Best model
- Use case: Maximum information utilization
- Trade-off: Slightly higher complexity
Feature Importance Insights
The Big Five Drivers
Housing prices in this dataset are primarily driven by:- Socioeconomic Status (
lstat,b) - Income and demographics - Property Size (
rm,zn) - Space and land - Education Quality (
ptratio) - School performance - Environmental Quality (
nox,indus) - Pollution and industry - Safety & Costs (
crim,tax) - Crime and taxes
Core Pricing Formula
The dataset reflects classic urban economic patterns where income, safety, and livability dominate valuation.
Feature Engineering Opportunities
Polynomial Features
Create interaction terms (e.g., rm² for non-linear room effect)Used in: Polynomial Regression (degree 2 & 3)
Feature Scaling
Normalize features for gradient descent algorithmsUsed in: SGDRegressor models
Interaction Terms
Combine features (e.g., rm × lstat for size-income interaction)Potential improvement opportunity
Log Transforms
Apply log to skewed features like crim, znReduces impact of outliers
Model-Specific Feature Usage
| Model | Features Used | Reasoning |
|---|---|---|
| Linear (Univariate) | rm only | Demonstrate single strongest feature |
| Linear (Feature Selection) | Top 6 by correlation | Balance simplicity and performance |
| Linear (Multivariate) | All 13 features | Maximum information - best performance |
| Polynomial Regression | All 13 + interactions | Capture non-linear relationships |
| Gradient Descent | All 13 (scaled) | Iterative optimization approach |
Correlation Heatmap Code
Next Steps
Dataset Overview
Learn about the dataset structure and statistics
Evaluation Metrics
Understand how model performance is measured