Skip to main content
Get started with house price prediction using machine learning on the Boston Housing dataset. This guide will help you run your first analysis and train your first models in under 5 minutes.

Prerequisites

Before you begin, ensure you have:
  • Python 3.12 or higher installed
  • Basic familiarity with Jupyter notebooks
  • Internet connection (for downloading the dataset)

Installation

1

Clone or download the project

Navigate to your project directory:
cd ~/workspace/source
2

Install dependencies

uv sync
UV is faster and more reliable for dependency management. It automatically creates and manages virtual environments.
3

Verify installation

Check that all packages are installed correctly:
python -c "import pandas, numpy, sklearn, kagglehub; print('All dependencies installed!')"

Running Your First Analysis

1. Data Exploration

Start Jupyter and open the analysis notebook:
jupyter notebook notebooks/analyze.ipynb
The dataset (Boston Housing) will be automatically downloaded from Kaggle on first run via kagglehub. No manual download needed!
What you’ll discover:
  • 506 samples with 13 features and 1 target variable
  • Target: medv (Median home value in $1000s)
  • 5 missing values in the rm column (automatically handled)
  • rm: Average number of rooms (strongest positive correlation: 0.696)
  • lstat: % lower status population (strongest negative correlation: -0.738)
  • ptratio: Pupil-teacher ratio (-0.508 correlation)
  • nox: Nitric oxides concentration (-0.427 correlation)
  • Only 5 missing values (0.98% in rm column)
  • Outliers detected in several columns (handled during training)
  • No categorical variables - all numeric features
Expected output from analyze.ipynb:
Path to dataset files: ~/.cache/kagglehub/datasets/arunjangir245/boston-housing-dataset/versions/2

Dataset shape: (506, 14)
Number of rows: 506
Number of columns: 14

Missing values after cleaning: 0

2. Model Training

Open the training notebook to build and evaluate multiple regression models:
jupyter notebook notebooks/train.ipynb
The training pipeline runs 9 different models:
1

Linear Regression Models

  • Univariate: Using only rm (rooms) feature
  • Multivariate: Using all 13 features
  • Feature Selection: Top 6 correlated features
2

Polynomial Regression

  • Degree 2 polynomial features
  • Degree 3 polynomial features
3

Advanced Models

  • SGD Regressor: Gradient descent with constant and adaptive learning rates
  • Decision Tree: With max depth of 5
  • Neural Network: MLP with (100, 50) hidden layers
4

Evaluation & Comparison

All models evaluated with:
  • R² (coefficient of determination)
  • RMSE (root mean squared error)
  • MAE (mean absolute error)
  • 5-fold cross-validation
Expected training output:
Setup complete!
Path to dataset files: ~/.cache/kagglehub/datasets/...

Missing values after cleaning: 0
Training set: 354 samples
Test set: 152 samples
Data saved to results/

==================================================
Model: Linear Regression (Multivariate)
==================================================
Train MSE: 22.5704 | Test MSE: 21.6188
Train RMSE: 4.7508 | Test RMSE: 4.6496
Train MAE: 3.3590 | Test MAE: 3.1761
Train R²: 0.7432 | Test R²: 0.7099
CV (mean±std): 0.6880 ± 0.0923
 Good fit

 Model and predictions saved!

3. Comparing Models

After training completes, the notebook automatically generates a comparison table:
Model Performance Ranking (by Test):
                           model_name  train_r2  test_r2  test_rmse  cv_r2_mean
             Decision Tree Regression  0.9277   0.8495   3.3487     0.7239
            Neural Network Regression  0.8456   0.8058   3.8044     0.7853
     Linear Regression (Multivariate)  0.7432   0.7100   4.6496     0.6880
                       SGD (adaptive)  0.7421   0.7102   4.6466     0.6901
     Polynomial Regression (degree=3)  0.5491   0.5825   5.5773     0.4908

🏆 Best Model: Decision Tree Regression
   Test R²: 0.8495
While Decision Tree shows the highest test R², it may be overfitting (train R² = 0.93 vs test R² = 0.85). Linear Regression (Multivariate) offers better generalization.

Understanding the Results

All trained models and outputs are saved in the results/ directory:
results/
├── *.joblib                # Trained models (loadable with joblib)
├── *.csv                   # Train/test splits
├── *.json                  # Evaluation metrics
├── *.npy                   # Model predictions
└── model_comparison.csv    # Full comparison table

Load and Use a Trained Model

import joblib
import numpy as np

# Load the best model
model = joblib.load('results/linear_multivariate.joblib')

# Make predictions on new data
sample_house = np.array([[0.00632, 18.0, 2.31, 0, 0.538, 
                          6.575, 65.2, 4.09, 1, 296, 
                          15.3, 396.90, 4.98]])
predicted_price = model.predict(sample_house)
print(f"Predicted price: ${predicted_price[0] * 1000:.2f}")

Next Steps

Installation Guide

Detailed setup instructions and troubleshooting

Understanding Models

Learn how each algorithm works

Feature Engineering

Improve model performance with better features

Model Evaluation

Understand R², RMSE, and cross-validation

Common Issues

The project uses kagglehub to automatically download the Boston Housing dataset. If the download fails:
  1. Check your internet connection
  2. Ensure kagglehub is installed: pip install kagglehub
  3. The dataset will be cached in ~/.cache/kagglehub/ after first download
If Jupyter can’t find the Python kernel:
python -m ipykernel install --user --name=.venv
Then select the .venv kernel in Jupyter.
The Neural Network model can be slow. To speed up training:
  • Reduce max_iter in the MLPRegressor (line 723 in train.ipynb)
  • Skip the neural network cell if you want quick results
  • The entire training process should complete in under 2 minutes

What You’ve Learned

  • ✅ Installed project dependencies using UV or pip
  • ✅ Explored the Boston Housing dataset with 506 samples and 13 features
  • ✅ Trained 9 different regression models
  • ✅ Evaluated models using R², RMSE, and cross-validation
  • ✅ Identified the best model (Decision Tree with R² = 0.85)
  • ✅ Saved and loaded trained models for future use
Ready to dive deeper? Check out the Installation Guide for advanced setup options or explore Linear Regression to understand the best performing algorithm.

Build docs developers (and LLMs) love