Overview
Beyond linear models, this project explores two advanced machine learning approaches: Decision Tree Regression and Neural Network Regression (MLP). These models can capture complex non-linear patterns that linear models cannot.Top Performer: Decision Tree Regression achieves the best test performance with Test R² of 0.850, significantly outperforming all linear models.
Model Comparison
| Model | Test R² | Test RMSE | Complexity |
|---|---|---|---|
| Decision Tree | 0.850 🏆 | 3.349 🏆 | Medium |
| Neural Network (MLP) | 0.806 | 3.804 | High |
| Linear Regression (Multivariate) | 0.710 | 4.650 | Low |
| SGD (adaptive) | 0.710 | 4.647 | Low |
| Polynomial (degree=3) | 0.583 | 5.577 | Low |
Decision Tree Regression
Overview
Decision trees make predictions by learning a series of if-then-else decision rules from the features. Each leaf node contains a prediction value.Implementation
Performance Metrics
- Train R²: 0.928
- Test R²: 0.850 🏆 BEST MODEL
- Train RMSE: 2.521
- Test RMSE: 3.349
- CV R² (mean±std): 0.724 ± 0.143
The Decision Tree achieves 20% better R² than linear regression (0.850 vs 0.710), demonstrating the value of capturing non-linear relationships.
How Decision Trees Work
- Splits: Tree chooses feature and threshold that best separates data
- Depth: max_depth=5 limits tree to 5 levels (prevents overfitting)
- Leaves: Final prediction is the average of training samples in that leaf
Feature Importance
Decision trees provide natural feature importance scores:Advantages
Pros
- Best test performance (R² = 0.850)
- Captures non-linear relationships
- No feature scaling needed
- Interpretable structure
- Fast prediction
- Handles feature interactions
Cons
- Some overfitting (train R² = 0.928)
- Can be unstable (high variance)
- Higher CV standard deviation (0.143)
- May not generalize to very different data
Hyperparameter Tuning
Neural Network Regression (MLP)
Overview
Multi-Layer Perceptron (MLP) is a feedforward neural network with hidden layers that can learn complex non-linear patterns through backpropagation.Architecture
Implementation
Performance Metrics
- Train R²: 0.846
- Test R²: 0.806
- Train RMSE: 3.684
- Test RMSE: 3.804
- CV R² (mean±std): 0.785 ± 0.109
Neural Network achieves 14% better R² than linear regression (0.806 vs 0.710) and shows excellent generalization with minimal overfitting (train-test gap = 0.040).
Why Feature Scaling Matters
Architecture Details
Hidden Layers:(100, 50)
- Layer 1: 13 inputs → 100 neurons → ReLU activation
- Layer 2: 100 inputs → 50 neurons → ReLU activation
- Output: 50 inputs → 1 neuron → Linear (predicted price)
- Layer 1: (13 × 100) + 100 bias = 1,400 parameters
- Layer 2: (100 × 50) + 50 bias = 5,050 parameters
- Output: (50 × 1) + 1 bias = 51 parameters
- Total: 6,501 trainable parameters
Activation Functions
Early Stopping
Early stopping prevents overfitting by monitoring validation loss and stopping training when performance plateaus.
Advantages
Pros
- Strong performance (R² = 0.806)
- Minimal overfitting (train-test gap = 0.040)
- Learns complex patterns
- Good generalization
- Low CV standard deviation (0.109)
Cons
- Requires feature scaling
- Longer training time (~1 second vs 0.1s for linear)
- Less interpretable (“black box”)
- More hyperparameters to tune
- Can get stuck in local minima
Hyperparameter Tuning
Complete Training Pipeline
Model Selection Guide
When to Use Decision Tree
✅ Best for:- Highest prediction accuracy needed (R² = 0.850)
- Feature importance interpretation
- No feature scaling allowed
- Fast prediction required
- Handling feature interactions
- Slightly higher overfitting risk
- May not generalize to very different data distributions
When to Use Neural Network
✅ Best for:- Complex non-linear patterns
- Good generalization (minimal overfitting)
- Stable predictions across CV folds
- Larger datasets (scales well)
- Transfer learning potential
- Requires feature scaling
- Longer training time
- Less interpretable
- More hyperparameters to tune
When to Use Linear Models
✅ Best for:- Interpretability critical
- Simple baseline
- Fast training and prediction
- Small datasets
- Feature coefficients needed
- Lower accuracy (R² = 0.710 vs 0.850)
- Cannot capture complex non-linear patterns
Final Comparison: All Models
| Rank | Model | Test R² | Test RMSE | Train-Test Gap | CV R² | Training Time |
|---|---|---|---|---|---|---|
| 🥇 | Decision Tree | 0.850 | 3.349 | 0.078 | 0.724 ± 0.143 | Fast |
| 🥈 | Neural Network | 0.806 | 3.804 | 0.040 | 0.785 ± 0.109 | Slow |
| 🥉 | Linear (Multi) | 0.710 | 4.650 | 0.033 | 0.688 ± 0.092 | Very Fast |
| 4 | SGD (adaptive) | 0.710 | 4.647 | 0.032 | 0.690 ± 0.090 | Fast |
| 5 | Linear (Feature Sel.) | 0.651 | 5.099 | 0.036 | 0.651 ± 0.090 | Very Fast |
| 6 | Polynomial (deg=3) | 0.583 | 5.577 | -0.034 | 0.491 ± 0.205 | Fast |
| 7 | Polynomial (deg=2) | 0.567 | 5.679 | -0.031 | 0.483 ± 0.224 | Fast |
| 8 | Linear (Univariate) | 0.458 | 6.355 | 0.031 | 0.452 ± 0.177 | Very Fast |
Key Takeaways
- Decision Tree is the winner: Achieves R² of 0.850, significantly better than all other models
- Neural Network is runner-up: R² of 0.806 with excellent generalization
- Non-linear models excel: Both advanced models outperform linear approaches by 14-20%
- Trade-offs exist: Higher accuracy comes with complexity and training time
- Linear models still valuable: Fast, interpretable, and good baseline (R² = 0.710)
Next Steps
Linear Regression
Review the best linear model baseline
Polynomial Regression
Understand polynomial feature transformations