Skip to main content

Overview

This project uses 5 key metrics to evaluate regression model performance. Each metric provides different insights into prediction accuracy and model quality.
All metrics are calculated on both training and test sets to detect overfitting or underfitting.

Regression Metrics

1. Mean Squared Error (MSE)

What is MSE?

MSE measures the average squared difference between predicted and actual values.Formula:
MSE = (1/n) × Σ(y_actual - y_predicted)²
Implementation (from eval_model_src.py:18):
from sklearn.metrics import mean_squared_error

train_mse = mean_squared_error(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
Characteristics:
  • Squares errors → penalizes large errors heavily
  • Always positive (0 = perfect prediction)
  • Unit: squared units (e.g., $1000²)
  • Lower is better
  • When you want to penalize large errors more than small ones
  • When outliers should have significant impact on model evaluation
  • When working with normally distributed residuals
Example: In house price prediction, a 10,000errorismorethantwiceasbadasa10,000 error is more than twice as bad as a 5,000 error.

2. Root Mean Squared Error (RMSE)

What is RMSE?

RMSE is the square root of MSE, bringing the metric back to original units.Formula:
RMSE =MSE = √[(1/n) × Σ(y_actual - y_predicted)²]
Implementation (from eval_model_src.py:19-20):
import numpy as np

train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
Characteristics:
  • Same unit as target variable ($1000s)
  • Easier to interpret than MSE
  • Still penalizes large errors
  • Lower is better
Example: RMSE = 4.650 means average prediction error of $4,650
  • When you need interpretable error magnitude in original units
  • When comparing models on the same dataset
  • Standard metric for regression competitions
Best Model in Project: Decision Tree Regression with Test RMSE = 3.349 ($3,349 average error)

3. Mean Absolute Error (MAE)

What is MAE?

MAE measures the average absolute difference between predictions and actuals.Formula:
MAE = (1/n) × Σ|y_actual - y_predicted|
Implementation (from eval_model_src.py:21-22):
from sklearn.metrics import mean_absolute_error

train_mae = mean_absolute_error(y_train, y_train_pred)
test_mae = mean_absolute_error(y_test, y_test_pred)
Characteristics:
  • Uses absolute values (no squaring)
  • Same unit as target variable
  • Does not penalize large errors as heavily as MSE/RMSE
  • More robust to outliers
  • Lower is better
MetricUse WhenSensitivity to Outliers
MAEOutliers should not dominateLow - treats all errors equally
MSE/RMSELarge errors are criticalHigh - squares amplify large errors
In this project: Both are used to get a complete picture of model performance.

4. R² Score (Coefficient of Determination)

What is R²?

measures the proportion of variance in the target variable explained by the model.Formula:
= 1 - (SS_residual / SS_total)
= 1 - [Σ(y_actual - y_predicted)² / Σ(y_actual - y_mean)²]
Implementation (from eval_model_src.py:23-24):
from sklearn.metrics import r2_score

train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)
Interpretation:
  • R² = 1.0: Perfect predictions (100% variance explained)
  • R² = 0.7: Model explains 70% of variance
  • R² = 0.0: Model no better than predicting the mean
  • R² < 0: Model worse than baseline (predicting mean)
  • Higher is better
R² RangeInterpretationQuality
0.9 - 1.0Excellent fit90-100% variance explained
0.7 - 0.9Good fit70-90% variance explained
0.5 - 0.7Moderate fit50-70% variance explained
0.3 - 0.5Weak fit30-50% variance explained
< 0.3Poor fitLess than 30% variance explained
Best Model in Project: Decision Tree Regression with Test R² = 0.850 (Very strong fit)
  • Scale-independent: Works across different datasets
  • Intuitive: ”% of variance explained”
  • Comparative: Easy to compare models
  • Standard: Widely used in statistics and ML
In this project, models are primarily ranked by Test R².

5. Cross-Validation (5-Fold CV)

What is Cross-Validation?

Cross-validation evaluates model performance across multiple train-test splits to ensure robustness.5-Fold CV Process:
  1. Split training data into 5 equal parts (folds)
  2. Train on 4 folds, test on 1 fold
  3. Repeat 5 times (each fold used as test once)
  4. Calculate mean and standard deviation of scores
Implementation (from eval_model_src.py:27-29):
from sklearn.model_selection import cross_val_score

# 5-fold cross-validation on training set
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='r2')
cv_r2_mean = cv_scores.mean()
cv_r2_std = cv_scores.std()
Output Format: 0.688 ± 0.092
  • Mean R² across 5 folds: 0.688
  • Standard deviation: 0.092 (indicates consistency)
Benefits:
  • Detects overfitting: If CV score much less than train score, indicates overfitting
  • Measures stability: Low std → consistent performance
  • Better estimate: More reliable than single train-test split
  • Reduces variance: Averages over multiple splits
Example (from project results):
Model: Linear Regression (Multivariate)
Train R²: 0.743
Test R²: 0.710
CV R²: 0.688 ± 0.092

✅ Good fit - consistent across splits

Model Evaluation Function

Here’s the complete evaluation function used in this project:
# From eval_model_src.py
from typing import Any, Dict, Tuple
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import cross_val_score
import numpy as np

def evaluate_model(
    model: Any,
    X_train: Any,
    X_test: Any,
    y_train: Any,
    y_test: Any,
    model_name: str
) -> Tuple[Dict[str, Any], Any]:
    """Evaluate model and return metrics"""
    
    # Generate predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    # Calculate all metrics
    metrics: Dict[str, Any] = {
        'model_name': model_name,
        'train_mse': mean_squared_error(y_train, y_train_pred),
        'test_mse': mean_squared_error(y_test, y_test_pred),
        'train_rmse': np.sqrt(mean_squared_error(y_train, y_train_pred)),
        'test_rmse': np.sqrt(mean_squared_error(y_test, y_test_pred)),
        'train_mae': mean_absolute_error(y_train, y_train_pred),
        'test_mae': mean_absolute_error(y_test, y_test_pred),
        'train_r2': r2_score(y_train, y_train_pred),
        'test_r2': r2_score(y_test, y_test_pred),
    }
    
    # 5-fold cross-validation
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='r2')
    metrics['cv_r2_mean'] = cv_scores.mean()
    metrics['cv_r2_std'] = cv_scores.std()
    
    return metrics, y_test_pred

Detecting Overfitting and Underfitting

The evaluation function includes automatic detection (from eval_model_src.py:44-51):

Overfitting Detection

Condition: train_r2 - test_r2 > 0.1
diff = metrics['train_r2'] - metrics['test_r2']
if diff > 0.1:
    print(f"⚠️  Overfitting detected (train-test R² gap: {diff:.4f})")
Interpretation:
  • Model memorizes training data but fails to generalize
  • Large gap between train and test performance
  • Solution: Regularization, simpler model, more data
Example: Polynomial degree 3 shows signs of overfitting
  • Train R² = 0.549
  • Test R² = 0.583
  • But high CV std (0.205) indicates instability
Condition: train_r2 < 0.5 AND test_r2 < 0.5
elif metrics['train_r2'] < 0.5 and metrics['test_r2'] < 0.5:
    print(f"⚠️  Underfitting detected (low R² on both sets)")
Interpretation:
  • Model too simple to capture patterns
  • Poor performance on both train and test
  • Solution: More features, more complex model, feature engineering
Example: Univariate Linear Regression
  • Train R² = 0.489
  • Test R² = 0.458
  • ⚠️ Underfitting - only using 1 feature (rm)
Condition: Neither overfitting nor underfitting
else:
    print(f"✅ Good fit")
Characteristics:
  • Train and test R² are close (gap less than 0.1)
  • Both scores are reasonably high (above 0.5)
  • Low CV standard deviation (less than 0.1)
Example: Multivariate Linear Regression
  • Train R² = 0.743
  • Test R² = 0.710
  • Gap = 0.033 ✅
  • CV R² = 0.688 ± 0.092 ✅

Model Performance Comparison

From actual model comparison results (all 9 models trained):
ModelTrain R²Test R²Test RMSECV R² (mean±std)Status
Decision Tree0.9280.8503.3490.724 ± 0.114🏆 Best Model
Neural Network (MLP)0.8460.8063.8040.785 ± 0.067✅ Excellent
SGD (adaptive)0.7420.7104.6470.690 ± 0.095✅ Good Fit
Linear (Multivariate)0.7430.7104.6500.688 ± 0.092✅ Good Fit
SGD (constant)0.7350.6944.7750.666 ± 0.101✅ Good Fit
Linear (Feature Selection)0.6870.6515.0990.651 ± 0.090✅ Moderate
Polynomial (degree=3)0.5490.5835.5770.491 ± 0.205⚠️ Unstable
Polynomial (degree=2)0.5360.5675.6790.483 ± 0.224⚠️ Unstable
Linear (Univariate)0.4890.4586.3550.452 ± 0.177⚠️ Underfitting
Winner: Decision Tree Regression achieves the highest Test R² (0.850) and lowest RMSE (3.349), outperforming all other models including Neural Networks and Linear Regression.

Metrics Output Example

Here’s what the evaluation output looks like (from eval_model_src.py:35-42):
def print_metrics(metrics: Dict[str, Any]) -> None:
    """Print model metrics"""
    print(f"\n{'='*50}")
    print(f"Model: {metrics['model_name']}")
    print(f"{'='*50}")
    print(f"Train MSE: {metrics['train_mse']:.4f} | Test MSE: {metrics['test_mse']:.4f}")
    print(f"Train RMSE: {metrics['train_rmse']:.4f} | Test RMSE: {metrics['test_rmse']:.4f}")
    print(f"Train MAE: {metrics['train_mae']:.4f} | Test MAE: {metrics['test_mae']:.4f}")
    print(f"Train R²: {metrics['train_r2']:.4f} | Test R²: {metrics['test_r2']:.4f}")
    print(f"CV R² (mean±std): {metrics['cv_r2_mean']:.4f} ± {metrics['cv_r2_std']:.4f}")
Example Output:
==================================================
Model: Linear Regression (Multivariate)
==================================================
Train MSE: 21.8946 | Test MSE: 21.6249
Train RMSE: 4.6792 | Test RMSE: 4.6504
Train MAE: 3.2707 | Test MAE: 3.1881
Train R²: 0.7435 | Test R²: 0.7100
CV R² (mean±std): 0.6876 ± 0.0923
✅ Good fit

Quick Reference

MSE

Mean Squared Error
  • Squares errors
  • Penalizes outliers heavily
  • Unit: squared ($1000²)
  • Lower is better

RMSE

Root Mean Squared Error
  • Square root of MSE
  • Original units ($1000s)
  • Interpretable magnitude
  • Lower is better

MAE

Mean Absolute Error
  • Absolute differences
  • Robust to outliers
  • Original units ($1000s)
  • Lower is better

R-Squared Score
  • % variance explained
  • Range: 0 to 1
  • Scale-independent
  • Higher is better

CV

Cross-Validation
  • 5-fold splits
  • Measures stability
  • Mean ± std format
  • Detects overfitting

Train vs Test

Overfitting Detection
  • Compare train & test R²
  • Gap > 0.1 → overfitting
  • Both < 0.5 → underfitting
  • Small gap → good fit

Next Steps

Dataset Overview

Learn about the Boston Housing dataset structure

Feature Analysis

Understand feature correlations and importance

Build docs developers (and LLMs) love