Introduction
Linear regression and logistic regression work well for many tasks, but they can run into a problem called overfitting, which causes poor performance. Understanding and addressing overfitting is crucial for building effective machine learning models.Overfitting occurs when a model fits the training data too well, including noise and random fluctuations, resulting in poor performance on new, unseen data.
Understanding Overfitting Through Examples
Regression Example: Housing Prices
Let’s revisit predicting housing prices based on size:- Underfitting (High Bias)
- Just Right
- Overfitting (High Variance)
Model: Linear function f(x) = w*x + bA straight line doesn’t capture the pattern in the data well. As house size increases, prices flatten out, but the linear model can’t represent this.Problem: The model is too simple. It has a strong preconception (bias) that the relationship must be linear, even when data suggests otherwise.Technical term: High bias / Underfitting
The goal is to find a model that’s “just right”—neither too simple (underfitting) nor too complex (overfitting).
Bias and Variance
High Bias (Underfitting)
Definition: The model is too simple to capture patterns in the data. Characteristics:- Poor performance on training data
- Poor performance on new data
- Model has strong preconceptions that may be wrong
High Variance (Overfitting)
Definition: The model is too complex and fits training data too well, including noise. Characteristics:- Excellent performance on training data
- Poor performance on new data
- Model predictions vary greatly with small changes in training data
Classification Example: Tumor Detection
Overfitting also occurs in classification problems. Consider classifying tumors as malignant or benign using features x₁ (tumor size) and x₂ (patient age):Underfitting (High Bias)
Underfitting (High Bias)
Model: Simple logistic regressionDecision boundary: Straight lineThe linear boundary doesn’t fit the data well—some malignant tumors are classified as benign and vice versa.Issue: Model is too simple to capture the classification pattern.
Just Right
Just Right
Model: Logistic regression with quadratic featuresDecision boundary: Ellipse or smooth curveThe model fits reasonably well without perfectly classifying every training example. It generalizes well to new patients.Result: Good balance between bias and variance.
Overfitting (High Variance)
Overfitting (High Variance)
Model: Logistic regression with many high-order polynomial featuresDecision boundary: Complex, wiggly curveThe boundary contorts itself to classify every training example perfectly, but this overly complex boundary won’t generalize well.Issue: Too many features lead to overfitting.
Addressing Overfitting
There are three main strategies to address overfitting:1. Collect More Training Data
Most Effective Solution
Adding more training examples helps the algorithm learn the true underlying pattern rather than memorizing noise.Example: With 100+ house price examples instead of 10, even a high-degree polynomial will fit a smoother curve.Limitation: More data isn’t always available or practical to collect.
2. Feature Selection
Reduce Number of Features
Reduce Number of Features
Idea: Use only the most relevant features.Example: Instead of 100 features (size, bedrooms, floors, age, neighborhood income, distance to coffee shop, etc.), select just 3-5 most important ones (size, bedrooms, age).Method:
- Manual selection based on intuition
- Automated feature selection algorithms
3. Regularization ⭐
Keep All Features, Reduce Parameter Sizes
Keep All Features, Reduce Parameter Sizes
Idea: Keep all features but prevent them from having too large an effect.How it works: Modify the cost function to penalize large parameter values:Where λ (lambda) is the regularization parameter.Benefits:
- Keeps all features (no information loss)
- Reduces overfitting by shrinking parameter values
- Works well in practice
Regularization Deep Dive
How Regularization Works
Consider an overfit model with large parameters:- Setting w₄ ≈ 0 effectively eliminates x⁴ term
- Shrinking w₃ reduces the impact of x³
- Result: Smoother curve that generalizes better
Regularization doesn’t set parameters to exactly zero (unless λ is very large), it just makes them smaller.
Regularized Cost Function
For linear regression:Choosing λ (Regularization Parameter)
- λ too small
- λ balanced
- λ too large
Little regularization effect → Still overfits
Implementation Example
Key Takeaways
Recognize overfitting and underfitting
Underfitting = too simple (high bias)
Overfitting = too complex (high variance)
Overfitting = too complex (high variance)
Practical Tips
Start Simple
Begin with a simple model and add complexity only if needed. It’s easier to add complexity than to debug an overly complex model.
Use Validation Sets
Split your data into training, validation, and test sets. Use the validation set to detect overfitting early.
Visualize Decision Boundaries
For 2D problems, plot decision boundaries to visually check if they’re reasonable or overly complex.
Monitor Training vs Test Performance
If training performance is much better than test performance, you’re likely overfitting.
What’s Next
Now that you understand overfitting and regularization:- Learn about cross-validation for better model evaluation
- Explore learning curves to diagnose bias vs variance
- Study advanced regularization techniques like L1 regularization (Lasso)
- Understand early stopping as another regularization approach
