Prerequisites
Before you begin, ensure you have:- Python 3.12 or higher installed
- Basic familiarity with Jupyter notebooks
- Internet connection (for downloading the dataset)
Installation
Running Your First Analysis
1. Data Exploration
Start Jupyter and open the analysis notebook:The dataset (Boston Housing) will be automatically downloaded from Kaggle on first run via
kagglehub. No manual download needed!Dataset Overview
Dataset Overview
- 506 samples with 13 features and 1 target variable
- Target:
medv(Median home value in $1000s) - 5 missing values in the
rmcolumn (automatically handled)
Key Features
Key Features
rm: Average number of rooms (strongest positive correlation: 0.696)lstat: % lower status population (strongest negative correlation: -0.738)ptratio: Pupil-teacher ratio (-0.508 correlation)nox: Nitric oxides concentration (-0.427 correlation)
Data Quality
Data Quality
- Only 5 missing values (0.98% in
rmcolumn) - Outliers detected in several columns (handled during training)
- No categorical variables - all numeric features
2. Model Training
Open the training notebook to build and evaluate multiple regression models:Linear Regression Models
- Univariate: Using only
rm(rooms) feature - Multivariate: Using all 13 features
- Feature Selection: Top 6 correlated features
Advanced Models
- SGD Regressor: Gradient descent with constant and adaptive learning rates
- Decision Tree: With max depth of 5
- Neural Network: MLP with (100, 50) hidden layers
3. Comparing Models
After training completes, the notebook automatically generates a comparison table:Understanding the Results
All trained models and outputs are saved in theresults/ directory:
Load and Use a Trained Model
Next Steps
Installation Guide
Detailed setup instructions and troubleshooting
Understanding Models
Learn how each algorithm works
Feature Engineering
Improve model performance with better features
Model Evaluation
Understand R², RMSE, and cross-validation
Common Issues
Dataset download fails
Dataset download fails
The project uses
kagglehub to automatically download the Boston Housing dataset. If the download fails:- Check your internet connection
- Ensure
kagglehubis installed:pip install kagglehub - The dataset will be cached in
~/.cache/kagglehub/after first download
Jupyter kernel not found
Jupyter kernel not found
If Jupyter can’t find the Python kernel:Then select the
.venv kernel in Jupyter.Models take too long to train
Models take too long to train
The Neural Network model can be slow. To speed up training:
- Reduce
max_iterin the MLPRegressor (line 723 in train.ipynb) - Skip the neural network cell if you want quick results
- The entire training process should complete in under 2 minutes
What You’ve Learned
- ✅ Installed project dependencies using UV or pip
- ✅ Explored the Boston Housing dataset with 506 samples and 13 features
- ✅ Trained 9 different regression models
- ✅ Evaluated models using R², RMSE, and cross-validation
- ✅ Identified the best model (Decision Tree with R² = 0.85)
- ✅ Saved and loaded trained models for future use