Quick Start

Get started with house price prediction using machine learning on the Boston Housing dataset. This guide will help you run your first analysis and train your first models in under 5 minutes.

Prerequisites

Before you begin, ensure you have:

Python 3.12 or higher installed
Basic familiarity with Jupyter notebooks
Internet connection (for downloading the dataset)

Installation

Clone or download the project

Navigate to your project directory:

cd ~/workspace/source

Install dependencies

uv sync

UV is faster and more reliable for dependency management. It automatically creates and manages virtual environments.

Verify installation

Check that all packages are installed correctly:

python -c "import pandas, numpy, sklearn, kagglehub; print('All dependencies installed!')"

Running Your First Analysis

1. Data Exploration

Start Jupyter and open the analysis notebook:

jupyter notebook notebooks/analyze.ipynb

The dataset (Boston Housing) will be automatically downloaded from Kaggle on first run via kagglehub. No manual download needed!

What you’ll discover:

Dataset Overview

506 samples with 13 features and 1 target variable
Target: medv (Median home value in $1000s)
5 missing values in the rm column (automatically handled)

Key Features

rm: Average number of rooms (strongest positive correlation: 0.696)
lstat: % lower status population (strongest negative correlation: -0.738)
ptratio: Pupil-teacher ratio (-0.508 correlation)
nox: Nitric oxides concentration (-0.427 correlation)

Data Quality

Only 5 missing values (0.98% in rm column)
Outliers detected in several columns (handled during training)
No categorical variables - all numeric features

Expected output from analyze.ipynb:

Path to dataset files: ~/.cache/kagglehub/datasets/arunjangir245/boston-housing-dataset/versions/2

Dataset shape: (506, 14)
Number of rows: 506
Number of columns: 14

Missing values after cleaning: 0

2. Model Training

Open the training notebook to build and evaluate multiple regression models:

jupyter notebook notebooks/train.ipynb

The training pipeline runs 9 different models:

Linear Regression Models

Univariate: Using only rm (rooms) feature
Multivariate: Using all 13 features
Feature Selection: Top 6 correlated features

Polynomial Regression

Degree 2 polynomial features
Degree 3 polynomial features

Advanced Models

SGD Regressor: Gradient descent with constant and adaptive learning rates
Decision Tree: With max depth of 5
Neural Network: MLP with (100, 50) hidden layers

Evaluation & Comparison

All models evaluated with:

R² (coefficient of determination)
RMSE (root mean squared error)
MAE (mean absolute error)
5-fold cross-validation

Expected training output:

Setup complete!
Path to dataset files: ~/.cache/kagglehub/datasets/...

Missing values after cleaning: 0
Training set: 354 samples
Test set: 152 samples
Data saved to results/

==================================================
Model: Linear Regression (Multivariate)
==================================================
Train MSE: 22.5704 | Test MSE: 21.6188
Train RMSE: 4.7508 | Test RMSE: 4.6496
Train MAE: 3.3590 | Test MAE: 3.1761
Train R²: 0.7432 | Test R²: 0.7099
CV R² (mean±std): 0.6880 ± 0.0923
✅ Good fit

✅ Model and predictions saved!

3. Comparing Models

After training completes, the notebook automatically generates a comparison table:

Model Performance Ranking (by Test R²):
                           model_name  train_r2  test_r2  test_rmse  cv_r2_mean
             Decision Tree Regression  0.9277   0.8495   3.3487     0.7239
            Neural Network Regression  0.8456   0.8058   3.8044     0.7853
     Linear Regression (Multivariate)  0.7432   0.7100   4.6496     0.6880
                       SGD (adaptive)  0.7421   0.7102   4.6466     0.6901
     Polynomial Regression (degree=3)  0.5491   0.5825   5.5773     0.4908

🏆 Best Model: Decision Tree Regression
   Test R²: 0.8495

While Decision Tree shows the highest test R², it may be overfitting (train R² = 0.93 vs test R² = 0.85). Linear Regression (Multivariate) offers better generalization.

Understanding the Results

All trained models and outputs are saved in the results/ directory:

results/
├── *.joblib                # Trained models (loadable with joblib)
├── *.csv                   # Train/test splits
├── *.json                  # Evaluation metrics
├── *.npy                   # Model predictions
└── model_comparison.csv    # Full comparison table

Load and Use a Trained Model

import joblib
import numpy as np

# Load the best model
model = joblib.load('results/linear_multivariate.joblib')

# Make predictions on new data
sample_house = np.array([[0.00632, 18.0, 2.31, 0, 0.538, 
                          6.575, 65.2, 4.09, 1, 296, 
                          15.3, 396.90, 4.98]])
predicted_price = model.predict(sample_house)
print(f"Predicted price: ${predicted_price[0] * 1000:.2f}")

Next Steps

Installation Guide

Detailed setup instructions and troubleshooting

Understanding Models

Learn how each algorithm works

Feature Engineering

Improve model performance with better features

Model Evaluation

Understand R², RMSE, and cross-validation

Common Issues

Dataset download fails

The project uses kagglehub to automatically download the Boston Housing dataset. If the download fails:

Check your internet connection
Ensure kagglehub is installed: pip install kagglehub
The dataset will be cached in ~/.cache/kagglehub/ after first download

Jupyter kernel not found

If Jupyter can’t find the Python kernel:

python -m ipykernel install --user --name=.venv

Then select the .venv kernel in Jupyter.

Models take too long to train

The Neural Network model can be slow. To speed up training:

Reduce max_iter in the MLPRegressor (line 723 in train.ipynb)
Skip the neural network cell if you want quick results
The entire training process should complete in under 2 minutes

What You’ve Learned

✅ Installed project dependencies using UV or pip
✅ Explored the Boston Housing dataset with 506 samples and 13 features
✅ Trained 9 different regression models
✅ Evaluated models using R², RMSE, and cross-validation
✅ Identified the best model (Decision Tree with R² = 0.85)
✅ Saved and loaded trained models for future use

Ready to dive deeper? Check out the Installation Guide for advanced setup options or explore Linear Regression to understand the best performing algorithm.

Get Started

Core Concepts

Workflows

Model Guide

Prerequisites

Installation

Running Your First Analysis

1. Data Exploration

2. Model Training

3. Comparing Models

Understanding the Results

Load and Use a Trained Model

Next Steps

Installation Guide

Understanding Models

Feature Engineering

Model Evaluation

Common Issues

What You’ve Learned

Build docs developers (and LLMs) love

Get Started

Core Concepts

Workflows

Model Guide

​Prerequisites

​Installation

​Running Your First Analysis

​1. Data Exploration

​2. Model Training

​3. Comparing Models

​Understanding the Results

​Load and Use a Trained Model

​Next Steps

Installation Guide

Understanding Models

Feature Engineering

Model Evaluation

​Common Issues

​What You’ve Learned

Build docs developers (and LLMs) love

Prerequisites

Installation

Running Your First Analysis

1. Data Exploration

2. Model Training

3. Comparing Models

Understanding the Results

Load and Use a Trained Model

Next Steps

Common Issues

What You’ve Learned