Skip to main content

Overview

Linear regression is the foundation of this project, implemented in three variants to demonstrate different modeling strategies. The multivariate model achieves strong baseline performance with a Test R² of 0.710 and RMSE of 4.650, ranking 4th among all 9 models tested.
Performance: Linear Regression (Multivariate) provides a solid interpretable baseline, though Decision Tree (R² = 0.850) and Neural Network (R² = 0.806) models achieve superior accuracy.

Three Variants

Univariate Linear Regression

Uses only the rm (average number of rooms) feature, which has the strongest correlation with house prices.Performance:
  • Train R²: 0.489
  • Test R²: 0.458
  • Test RMSE: 6.355
  • CV R² (mean±std): 0.452 ± 0.177
This model suffers from underfitting due to using only one feature. It’s too simple to capture the complexity of house pricing.
Code Implementation:
from sklearn.linear_model import LinearRegression

# Select only 'rm' feature
X_train_uni = X_train[['rm']]
X_test_uni = X_test[['rm']]

# Train univariate model
lr_uni = LinearRegression()
lr_uni.fit(X_train_uni, y_train)
When to use: Educational purposes to understand single-feature relationships. Not recommended for production.

Performance Comparison

import numpy as np
from sklearn.metrics import mean_squared_error, r2_score

# Univariate
y_pred_uni = lr_uni.predict(X_test_uni)
print(f"Univariate - R²: {r2_score(y_test, y_pred_uni):.3f}")
# Output: Univariate - R²: 0.458

# Multivariate (BEST)
y_pred_multi = lr_multi.predict(X_test)
print(f"Multivariate - R²: {r2_score(y_test, y_pred_multi):.3f}")
# Output: Multivariate - R²: 0.710

# Feature Selection
y_pred_fs = lr_fs.predict(X_test_fs)
print(f"Feature Selection - R²: {r2_score(y_test, y_pred_fs):.3f}")
# Output: Feature Selection - R²: 0.651

Key Insights

Why Multivariate Wins

  1. More information: Uses all 13 features to capture complex relationships
  2. Balanced fit: Train-test R² gap of only 0.033 indicates minimal overfitting
  3. Stable performance: Low cross-validation standard deviation (0.092)
  4. Practical accuracy: RMSE of 4.65 means predictions are typically within $4,650 of actual prices

Trade-offs

VariantProsCons
UnivariateSimple, interpretableUnderfits, poor accuracy
MultivariateBest accuracy, stableUses all features
Feature SelectionReduced complexitySlightly lower accuracy

Mathematical Foundation

Linear regression models the relationship as:
y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε
Where:
  • y = predicted house price (medv)
  • β₀ = intercept
  • β₁...βₙ = coefficients for each feature
  • x₁...xₙ = feature values
  • ε = error term

Next Steps

Polynomial Regression

Explore non-linear relationships using polynomial features

Gradient Descent

Learn about iterative optimization with SGDRegressor

Build docs developers (and LLMs) love