Overview
This project uses 5 key metrics to evaluate regression model performance. Each metric provides different insights into prediction accuracy and model quality.All metrics are calculated on both training and test sets to detect overfitting or underfitting.
Regression Metrics
1. Mean Squared Error (MSE)
What is MSE?
What is MSE?
MSE measures the average squared difference between predicted and actual values.Formula:Implementation (from Characteristics:
eval_model_src.py:18):- Squares errors → penalizes large errors heavily
- Always positive (0 = perfect prediction)
- Unit: squared units (e.g., $1000²)
- Lower is better
When to use MSE
When to use MSE
- When you want to penalize large errors more than small ones
- When outliers should have significant impact on model evaluation
- When working with normally distributed residuals
2. Root Mean Squared Error (RMSE)
What is RMSE?
What is RMSE?
RMSE is the square root of MSE, bringing the metric back to original units.Formula:Implementation (from Characteristics:
eval_model_src.py:19-20):- Same unit as target variable ($1000s)
- Easier to interpret than MSE
- Still penalizes large errors
- Lower is better
When to use RMSE
When to use RMSE
- When you need interpretable error magnitude in original units
- When comparing models on the same dataset
- Standard metric for regression competitions
3. Mean Absolute Error (MAE)
What is MAE?
What is MAE?
MAE measures the average absolute difference between predictions and actuals.Formula:Implementation (from Characteristics:
eval_model_src.py:21-22):- Uses absolute values (no squaring)
- Same unit as target variable
- Does not penalize large errors as heavily as MSE/RMSE
- More robust to outliers
- Lower is better
MSE vs MAE: When to use each
MSE vs MAE: When to use each
| Metric | Use When | Sensitivity to Outliers |
|---|---|---|
| MAE | Outliers should not dominate | Low - treats all errors equally |
| MSE/RMSE | Large errors are critical | High - squares amplify large errors |
4. R² Score (Coefficient of Determination)
What is R²?
What is R²?
R² measures the proportion of variance in the target variable explained by the model.Formula:Implementation (from Interpretation:
eval_model_src.py:23-24):- R² = 1.0: Perfect predictions (100% variance explained)
- R² = 0.7: Model explains 70% of variance
- R² = 0.0: Model no better than predicting the mean
- R² < 0: Model worse than baseline (predicting mean)
- Higher is better
Understanding R² Values
Understanding R² Values
| R² Range | Interpretation | Quality |
|---|---|---|
| 0.9 - 1.0 | Excellent fit | 90-100% variance explained |
| 0.7 - 0.9 | Good fit | 70-90% variance explained |
| 0.5 - 0.7 | Moderate fit | 50-70% variance explained |
| 0.3 - 0.5 | Weak fit | 30-50% variance explained |
| < 0.3 | Poor fit | Less than 30% variance explained |
Why R² is the Primary Metric
Why R² is the Primary Metric
- Scale-independent: Works across different datasets
- Intuitive: ”% of variance explained”
- Comparative: Easy to compare models
- Standard: Widely used in statistics and ML
5. Cross-Validation (5-Fold CV)
What is Cross-Validation?
What is Cross-Validation?
Cross-validation evaluates model performance across multiple train-test splits to ensure robustness.5-Fold CV Process:Output Format:
- Split training data into 5 equal parts (folds)
- Train on 4 folds, test on 1 fold
- Repeat 5 times (each fold used as test once)
- Calculate mean and standard deviation of scores
eval_model_src.py:27-29):0.688 ± 0.092- Mean R² across 5 folds: 0.688
- Standard deviation: 0.092 (indicates consistency)
Why Use Cross-Validation?
Why Use Cross-Validation?
Benefits:
- Detects overfitting: If CV score much less than train score, indicates overfitting
- Measures stability: Low std → consistent performance
- Better estimate: More reliable than single train-test split
- Reduces variance: Averages over multiple splits
Model Evaluation Function
Here’s the complete evaluation function used in this project:Detecting Overfitting and Underfitting
The evaluation function includes automatic detection (fromeval_model_src.py:44-51):
Overfitting Detection
Overfitting Detection
Condition: Interpretation:
train_r2 - test_r2 > 0.1- Model memorizes training data but fails to generalize
- Large gap between train and test performance
- Solution: Regularization, simpler model, more data
- Train R² = 0.549
- Test R² = 0.583
- But high CV std (0.205) indicates instability
Underfitting Detection
Underfitting Detection
Condition: Interpretation:
train_r2 < 0.5 AND test_r2 < 0.5- Model too simple to capture patterns
- Poor performance on both train and test
- Solution: More features, more complex model, feature engineering
- Train R² = 0.489
- Test R² = 0.458
- ⚠️ Underfitting - only using 1 feature (rm)
Good Fit
Good Fit
Condition: Neither overfitting nor underfittingCharacteristics:
- Train and test R² are close (gap less than 0.1)
- Both scores are reasonably high (above 0.5)
- Low CV standard deviation (less than 0.1)
- Train R² = 0.743
- Test R² = 0.710
- Gap = 0.033 ✅
- CV R² = 0.688 ± 0.092 ✅
Model Performance Comparison
From actual model comparison results (all 9 models trained):| Model | Train R² | Test R² | Test RMSE | CV R² (mean±std) | Status |
|---|---|---|---|---|---|
| Decision Tree | 0.928 | 0.850 | 3.349 | 0.724 ± 0.114 | 🏆 Best Model |
| Neural Network (MLP) | 0.846 | 0.806 | 3.804 | 0.785 ± 0.067 | ✅ Excellent |
| SGD (adaptive) | 0.742 | 0.710 | 4.647 | 0.690 ± 0.095 | ✅ Good Fit |
| Linear (Multivariate) | 0.743 | 0.710 | 4.650 | 0.688 ± 0.092 | ✅ Good Fit |
| SGD (constant) | 0.735 | 0.694 | 4.775 | 0.666 ± 0.101 | ✅ Good Fit |
| Linear (Feature Selection) | 0.687 | 0.651 | 5.099 | 0.651 ± 0.090 | ✅ Moderate |
| Polynomial (degree=3) | 0.549 | 0.583 | 5.577 | 0.491 ± 0.205 | ⚠️ Unstable |
| Polynomial (degree=2) | 0.536 | 0.567 | 5.679 | 0.483 ± 0.224 | ⚠️ Unstable |
| Linear (Univariate) | 0.489 | 0.458 | 6.355 | 0.452 ± 0.177 | ⚠️ Underfitting |
Winner: Decision Tree Regression achieves the highest Test R² (0.850) and lowest RMSE (3.349), outperforming all other models including Neural Networks and Linear Regression.
Metrics Output Example
Here’s what the evaluation output looks like (fromeval_model_src.py:35-42):
Quick Reference
MSE
Mean Squared Error
- Squares errors
- Penalizes outliers heavily
- Unit: squared ($1000²)
- Lower is better
RMSE
Root Mean Squared Error
- Square root of MSE
- Original units ($1000s)
- Interpretable magnitude
- Lower is better
MAE
Mean Absolute Error
- Absolute differences
- Robust to outliers
- Original units ($1000s)
- Lower is better
R²
R-Squared Score
- % variance explained
- Range: 0 to 1
- Scale-independent
- Higher is better
CV
Cross-Validation
- 5-fold splits
- Measures stability
- Mean ± std format
- Detects overfitting
Train vs Test
Overfitting Detection
- Compare train & test R²
- Gap > 0.1 → overfitting
- Both < 0.5 → underfitting
- Small gap → good fit
Next Steps
Dataset Overview
Learn about the Boston Housing dataset structure
Feature Analysis
Understand feature correlations and importance