Model Training Workflow
This guide provides a complete walkthrough of the model training process usingtrain.ipynb. You’ll learn how to train 9 different regression models, from simple linear regression to neural networks.
Overview
The training notebook implements:- 70/30 train-test split for model evaluation
- 5-fold cross-validation to assess model stability
- Multiple model types: Linear, Polynomial, Gradient Descent, Decision Tree, Neural Network
- Comprehensive metrics: MSE, RMSE, MAE, R², Cross-validation scores
Setup and Data Preparation
All split data is automatically saved to the
results/ directory:X_train.csv,X_test.csvy_train.csv,y_test.csv
Models Trained
1. Univariate Linear Regression
Trains on only the strongest feature (rm - number of rooms):
results/linear_univariate.joblibresults/pred_linear_univariate.npyresults/metrics_linear_univariate.json
2. Multivariate Linear Regression
Uses all 13 features for prediction:results/linear_multivariate.joblibresults/pred_linear_multivariate.npyresults/metrics_linear_multivariate.json
3. Feature Selection Model
Selects top 6 correlated features:results/linear_feature_selection.joblibresults/pred_linear_feature_selection.npy
4. Polynomial Regression (Degree 2 & 3)
Creates polynomial features fromrm (rooms):
results/polynomial_degree2.joblibresults/polynomial_transformer_degree2.joblibresults/pred_polynomial_degree2.npy
5. Stochastic Gradient Descent (SGD)
Implements gradient descent optimization with two learning rate strategies:The adaptive learning rate performs slightly better, automatically adjusting the step size during optimization.
results/SGD_constant.joblibresults/SGD_adaptive.joblibresults/scaler.joblib(important for prediction!)
6. Decision Tree Regression
results/decision_tree.joblibresults/pred_decision_tree.npy
7. Neural Network (MLP Regressor)
results/neural_network.joblibresults/pred_neural_network.npy
Results Directory Structure
After training completes, yourresults/ directory contains:
File formats:
.joblib: Serialized scikit-learn models (usejoblib.load()to restore).npy: NumPy arrays with predictions (usenp.load()).json: Metrics in JSON format.csv: Data tables
Understanding the Output Metrics
Each model is evaluated with multiple metrics:R² (R-squared) Score
R² (R-squared) Score
Measures how well the model explains variance in the target variable.
- Range: -∞ to 1.0 (higher is better)
- 1.0 = Perfect predictions
- 0.0 = Model performs like predicting the mean
- Negative = Model performs worse than baseline
RMSE (Root Mean Squared Error)
RMSE (Root Mean Squared Error)
Average prediction error in the same units as the target (thousands of dollars).
- Lower is better
- RMSE of 4.65 means predictions are off by ~$4,650 on average
- More sensitive to large errors than MAE
Cross-Validation (CV) Score
Cross-Validation (CV) Score
Average performance across 5 different train/test splits.
- More reliable than single test set score
- Standard deviation shows model stability
- High std (>0.15) indicates inconsistent performance
Next Steps
After training all models:Compare Models
Run the comparison script to identify the best model
Make Predictions
Use trained models for new predictions
Troubleshooting
ConvergenceWarning from SGD or Neural Network
ConvergenceWarning from SGD or Neural Network
This means the model didn’t fully converge. Solutions:
- Increase
max_iterparameter (e.g., from 1000 to 5000) - Adjust learning rate (
eta0) - Ensure features are properly scaled
High train/test R² gap (overfitting)
High train/test R² gap (overfitting)
If train R² >> test R², the model is overfitting:
- For Decision Trees: Reduce
max_depthor increasemin_samples_split - For Neural Networks: Add dropout or reduce layer sizes
- For Polynomial: Use lower degree
Results directory not created
Results directory not created
Ensure you have write permissions: