Feature Analysis

Feature Overview

The Boston Housing dataset contains 13 features that capture different aspects of property characteristics and neighborhood quality. Understanding how these features correlate with house prices is crucial for building effective prediction models.

Feature Descriptions

Property Features

rm: Average number of rooms
age: Proportion built before 1940
chas: Near Charles River

Location Features

crim: Crime rate per capita
zn: Residential land zoning
dis: Distance to employment centers
rad: Highway accessibility index

Environmental Features

nox: Nitrogen oxide pollution
indus: Industrial area proportion

Economic Features

tax: Property tax rate
ptratio: Student-teacher ratio
lstat: % lower status population
b: Demographic index

Correlation Analysis

Correlation analysis reveals which features have the strongest relationship with house prices (medv).

Strong Positive Correlations

Features that increase house prices:

rm (0.696) - Average Number of Rooms

Strongest positive correlation with price

More rooms → Larger homes → Higher value
Indicates space, comfort, and higher socioeconomic status
Used as the single feature in univariate linear regression

# Correlation: 0.696
# Interpretation: A 1-room increase correlates with ~$9,100 price increase

Why it matters: Size is one of the strongest drivers of housing price in any market.

zn (0.360) - Residential Land Proportion

Higher proportion of large residential lots (>25,000 sq.ft)
Indicates suburban or affluent areas with more space
Provides privacy, greenery, and lower congestion

Interpretation: Land size creates a location-based premium

b (0.333) - Demographic Index

Historically linked to racial and socioeconomic demographics
Reflects structural socioeconomic patterns in Boston
Captures historical housing segregation effects

This feature reflects socioeconomic patterns from the 1970s dataset era rather than physical housing characteristics.

dis (0.250) - Distance to Employment

Slight suburban premium effect
Some buyers prefer quieter neighborhoods further from urban core
Balances accessibility with residential quality

Strong Negative Correlations

Features that decrease house prices:

lstat (-0.738) - % Lower Status Population

Strongest negative correlation with price

Indicates economic disadvantage in the area
Lower purchasing power reduces property demand
Often correlated with crime, school quality, and infrastructure

# Correlation: -0.738
# Interpretation: Strong predictor - income level dominates valuation

Insight: Income level is the single strongest predictor of house price in this dataset.

ptratio (-0.508) - Pupil-Teacher Ratio

Higher ratio suggests overcrowded schools
Lower school quality reduces demand from families
School performance heavily influences housing markets

Interpretation: Education quality drives neighborhood desirability

indus (-0.484) - Industrial Area Proportion

More industry means noise, pollution, and traffic
Lower aesthetic and residential appeal
Reduces quality of life

Interpretation: Environmental quality significantly impacts price

tax (-0.469) - Property Tax Rate

Higher taxes increase cost of ownership
Reduces affordability and buyer demand
Ongoing financial burden

Interpretation: Ownership cost directly affects valuation

nox (-0.427) - Air Pollution

Poor air quality (nitrogen oxides) decreases livability
Health concerns reduce neighborhood desirability

Interpretation: Clean environment carries a price premium

crim (-0.388) - Crime Rate

Higher crime lowers perceived safety
Reduces demand and increases risk premium

Interpretation: Safety is a major component of housing value

Correlation Matrix Visualization

import pandas as pd
import numpy as np

# Top correlations with target (medv)
correlations = {
    'rm': 0.696,        # Positive: More rooms
    'zn': 0.360,        # Positive: Large lots
    'b': 0.333,         # Positive: Demographics
    'dis': 0.250,       # Positive: Distance to work
    'chas': 0.175,      # Positive: River proximity
    'age': -0.377,      # Negative: Older homes
    'rad': -0.382,      # Negative: Highway access
    'crim': -0.388,     # Negative: Crime
    'nox': -0.427,      # Negative: Pollution
    'tax': -0.469,      # Negative: Property tax
    'indus': -0.484,    # Negative: Industrial area
    'ptratio': -0.508,  # Negative: School quality
    'lstat': -0.738     # Negative: Low income %
}

Feature Selection Strategy

Different models in this project use different feature selection approaches:

1. Univariate Regression

Single Feature: rm (average rooms)

Correlation: 0.696
Performance: R² = 0.458 (test)
Use case: Demonstrates simple linear relationship

2. Feature Selection (Top 6)

Selected Features: Top 6 features by absolute correlation

lstat (-0.738)
rm (0.696)
ptratio (-0.508)
indus (-0.484)
tax (-0.469)
nox (-0.427)

Performance: R² = 0.651 (test)
Use case: Balance between simplicity and performance

3. Multivariate Regression

All Features: All 13 features included

Performance: R² = 0.710 (test) - Best model
Use case: Maximum information utilization
Trade-off: Slightly higher complexity

Feature Importance Insights

The Big Five Drivers

Housing prices in this dataset are primarily driven by:

Socioeconomic Status (lstat, b) - Income and demographics
Property Size (rm, zn) - Space and land
Education Quality (ptratio) - School performance
Environmental Quality (nox, indus) - Pollution and industry
Safety & Costs (crim, tax) - Crime and taxes

Core Pricing Formula

House Price ≈ 
  Space + Income Level + School Quality 
  + Environmental Quality + Safety 
  - Cost Burden

The dataset reflects classic urban economic patterns where income, safety, and livability dominate valuation.

Feature Engineering Opportunities

Polynomial Features

Create interaction terms (e.g., rm² for non-linear room effect)Used in: Polynomial Regression (degree 2 & 3)

Feature Scaling

Normalize features for gradient descent algorithmsUsed in: SGDRegressor models

Interaction Terms

Combine features (e.g., rm × lstat for size-income interaction)Potential improvement opportunity

Log Transforms

Apply log to skewed features like crim, znReduces impact of outliers

Model-Specific Feature Usage

Model	Features Used	Reasoning
Linear (Univariate)	rm only	Demonstrate single strongest feature
Linear (Feature Selection)	Top 6 by correlation	Balance simplicity and performance
Linear (Multivariate)	All 13 features	Maximum information - best performance
Polynomial Regression	All 13 + interactions	Capture non-linear relationships
Gradient Descent	All 13 (scaled)	Iterative optimization approach

Correlation Heatmap Code

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
df = pd.read_csv('BostonHousing.csv')

# Calculate correlation matrix
corr_matrix = df.corr()

# Plot heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Boston Housing - Feature Correlation Matrix')
plt.tight_layout()
plt.show()

# Show top correlations with target
target_corr = corr_matrix['medv'].drop('medv').sort_values(ascending=False)
print("\nTop correlations with medv:")
print(target_corr)

Get Started

Core Concepts

Workflows

Model Guide

Feature Overview

Feature Descriptions

Property Features

Location Features

Environmental Features

Economic Features

Correlation Analysis

Strong Positive Correlations

Strong Negative Correlations

Correlation Matrix Visualization

Feature Selection Strategy

1. Univariate Regression

2. Feature Selection (Top 6)

3. Multivariate Regression

Feature Importance Insights

The Big Five Drivers

Core Pricing Formula

Feature Engineering Opportunities

Polynomial Features

Feature Scaling

Interaction Terms

Log Transforms

Model-Specific Feature Usage

Correlation Heatmap Code

Next Steps

Dataset Overview

Evaluation Metrics

Build docs developers (and LLMs) love

Get Started

Core Concepts

Workflows

Model Guide

​Feature Overview

​Feature Descriptions

Property Features

Location Features

Environmental Features

Economic Features

​Correlation Analysis

​Strong Positive Correlations

​Strong Negative Correlations

​Correlation Matrix Visualization

​Feature Selection Strategy

​1. Univariate Regression

​2. Feature Selection (Top 6)

​3. Multivariate Regression

​Feature Importance Insights

​The Big Five Drivers

​Core Pricing Formula

​Feature Engineering Opportunities

Polynomial Features

Feature Scaling

Interaction Terms

Log Transforms

​Model-Specific Feature Usage

​Correlation Heatmap Code

​Next Steps

Dataset Overview

Evaluation Metrics

Build docs developers (and LLMs) love

Feature Overview

Feature Descriptions

Correlation Analysis

Strong Positive Correlations

Strong Negative Correlations

Correlation Matrix Visualization

Feature Selection Strategy

1. Univariate Regression

2. Feature Selection (Top 6)

3. Multivariate Regression

Feature Importance Insights

The Big Five Drivers

Core Pricing Formula

Feature Engineering Opportunities

Model-Specific Feature Usage

Correlation Heatmap Code

Next Steps