Skip to main content

Overview

Polynomial regression extends linear regression by creating polynomial features, allowing the model to capture non-linear relationships. This project implements degree 2 and degree 3 polynomial transformations on the univariate rm (rooms) feature.
Performance: Polynomial degree 3 achieves Test R² of 0.583, outperforming simple univariate linear regression (0.458) but still below multivariate linear regression (0.710).

How Polynomial Features Work

Starting with a single feature rm, polynomial transformation creates additional features: Degree 2: Creates rm²
Degree 3: Creates rm² and rm³
This allows the model to fit curves instead of straight lines.

Mathematical Formula

y = β₀ + β₁(rm) + β₂(rm²)
Creates a parabolic (curved) relationship

Implementation

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

# Use only 'rm' feature
X_train_uni = X_train[['rm']]
X_test_uni = X_test[['rm']]

# Create polynomial features of degree 2
poly2 = PolynomialFeatures(degree=2, include_bias=False)
X_train_poly2 = poly2.fit_transform(X_train_uni)
X_test_poly2 = poly2.transform(X_test_uni)

# Train linear regression on polynomial features
lr_poly2 = LinearRegression()
lr_poly2.fit(X_train_poly2, y_train)

# Make predictions
predictions = lr_poly2.predict(X_test_poly2)

Performance Results

Degree 2 Performance

  • Train R²: 0.536
  • Test R²: 0.567
  • Train RMSE: 6.385
  • Test RMSE: 5.679
  • CV R² (mean±std): 0.483 ± 0.224
Degree 2 shows slight improvement over univariate linear regression (Test R² 0.567 vs 0.458).

Degree 3 Performance

  • Train R²: 0.549
  • Test R²: 0.583 ⭐
  • Train RMSE: 6.296
  • Test RMSE: 5.577
  • CV R² (mean±std): 0.491 ± 0.205
Degree 3 performs slightly better than degree 2, with Test R² of 0.583.

Comparison: Polynomial vs Linear

ModelTest R²Test RMSECV R²
Linear (Univariate)0.4586.3550.452 ± 0.177
Polynomial (degree=2)0.5675.6790.483 ± 0.224
Polynomial (degree=3)0.5835.5770.491 ± 0.205
Linear (Multivariate)0.710 🏆4.650 🏆0.688 ± 0.092 🏆
Key Finding: Polynomial regression improves over univariate linear regression but still underperforms multivariate linear regression (0.583 vs 0.710). This suggests that using more features is more valuable than creating polynomial features from a single variable.

Overfitting Risk

Polynomial regression is prone to overfitting, especially with higher degrees:

Degree Comparison

print("Degree 2:")
print(f"  Train R²: 0.5362, Test R²: 0.5672")
print(f"  Gap: {0.5362 - 0.5672:.4f} (negative = good generalization)")

print("\nDegree 3:")
print(f"  Train R²: 0.5491, Test R²: 0.5825")
print(f"  Gap: {0.5491 - 0.5825:.4f} (negative = good generalization)")
Interestingly, both polynomial models show better test performance than training performance, indicating they generalize well without overfitting.

Why Not Higher Degrees?

Higher polynomial degrees (4, 5, 6+) were not tested because:
  1. Overfitting risk: Higher degrees fit training data too closely
  2. Diminishing returns: Degree 3 already shows limited improvement
  3. Better alternatives: Multivariate linear regression already outperforms polynomial approaches

Feature Transformation Visualization

import numpy as np

# Example: rm = 6.5 rooms
rm_value = 6.5

print("Original feature:")
print(f"  rm = {rm_value}")

print("\nDegree 2 features:")
print(f"  rm = {rm_value}")
print(f"  rm² = {rm_value**2}")

print("\nDegree 3 features:")
print(f"  rm = {rm_value}")
print(f"  rm² = {rm_value**2}")
print(f"  rm³ = {rm_value**3}")
Output:
Original feature:
  rm = 6.5

Degree 2 features:
  rm = 6.5
  rm² = 42.25

Degree 3 features:
  rm = 6.5
  rm² = 42.25
  rm³ = 274.625

When to Use Polynomial Regression

Good Use Cases

  • Single feature with clear non-linear pattern
  • Visualizing curved relationships
  • Educational purposes

Not Recommended

  • When multiple features are available (use multivariate instead)
  • High-dimensional data (risk of overfitting)
  • Need for interpretability

Key Takeaways

  1. Polynomial features improve univariate models: Degree 3 achieves R² of 0.583 vs 0.458 for linear
  2. But don’t beat multivariate linear: Multivariate linear regression (0.710) still performs better
  3. No overfitting observed: Both degrees generalize well to test data
  4. Diminishing returns: Degree 3 only slightly better than degree 2
  5. Use more features: Adding features is more effective than polynomial transformation

Next Steps

Linear Regression

Compare with the best-performing multivariate linear model

Gradient Descent

Explore optimization techniques with SGDRegressor

Build docs developers (and LLMs) love