Gradient Descent (SGDRegressor)

Overview

Gradient descent is an iterative optimization algorithm that finds model parameters by minimizing the loss function. This project uses SGDRegressor from scikit-learn, which implements stochastic gradient descent for linear regression.

SGDRegressor with adaptive learning rate achieves Test R² of 0.710, matching the performance of standard multivariate linear regression while demonstrating iterative optimization.

Why Gradient Descent?

While standard linear regression uses a closed-form solution (normal equation), gradient descent is essential for:

Large datasets: More memory-efficient than computing matrix inversions
Online learning: Can update model with new data without retraining from scratch
Non-linear models: Foundation for neural networks and deep learning
Sparse features: Efficient with high-dimensional data

Feature Scaling Required

Critical: Gradient descent requires feature scaling because features with different scales cause the optimization to converge slowly or oscillate.

from sklearn.preprocessing import StandardScaler

# Scale features before gradient descent
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# StandardScaler transforms features to:
# mean = 0, standard deviation = 1

Implementation

Constant Learning Rate
Adaptive Learning Rate

SGDRegressor with Constant Learning Rate

Uses a fixed learning rate (eta0) throughout training.Hyperparameters:

loss='squared_error': Ordinary least squares
learning_rate='constant': Fixed step size
eta0=0.01: Learning rate value
max_iter=1000: Maximum training iterations
early_stopping=True: Stop if validation score doesn’t improve

Code:

from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler

# Scale features (required for SGD)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train SGD with constant learning rate
sgd_constant = SGDRegressor(
    loss='squared_error',
    learning_rate='constant',
    eta0=0.01,
    max_iter=1000,
    random_state=42,
    early_stopping=True,
    validation_fraction=0.1
)

sgd_constant.fit(X_train_scaled, y_train)
predictions = sgd_constant.predict(X_test_scaled)

Performance:

Train R²: 0.735
Test R²: 0.694
Test RMSE: 4.775
CV R² (mean±std): 0.666 ± 0.088

Good performance but slightly below adaptive learning rate.

SGDRegressor with Adaptive Learning Rate

Automatically adjusts learning rate during training for better convergence.Hyperparameters:

loss='squared_error': Ordinary least squares
learning_rate='adaptive': Decreases when loss stops improving
eta0=0.01: Initial learning rate
max_iter=1000: Maximum training iterations
early_stopping=True: Stop if validation score doesn’t improve

Code:

from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler

# Scale features (required for SGD)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train SGD with adaptive learning rate
sgd_adaptive = SGDRegressor(
    loss='squared_error',
    learning_rate='adaptive',
    eta0=0.01,
    max_iter=1000,
    random_state=42,
    early_stopping=True,
    validation_fraction=0.1
)

sgd_adaptive.fit(X_train_scaled, y_train)
predictions = sgd_adaptive.predict(X_test_scaled)

Performance:

Train R²: 0.742 ⭐
Test R²: 0.710 ⭐
Test RMSE: 4.647
CV R² (mean±std): 0.690 ± 0.090

Best SGD configuration - matches standard linear regression performance!

Performance Comparison

Model	Train R²	Test R²	Test RMSE	CV R²
SGD (constant)	0.735	0.694	4.775	0.666 ± 0.088
SGD (adaptive)	0.742	0.710 ⭐	4.647	0.690 ± 0.090
Linear Regression (Multivariate)	0.743	0.710	4.650	0.688 ± 0.092

Key Finding: SGD with adaptive learning rate achieves identical Test R² (0.710) to standard linear regression, proving that gradient descent can match closed-form solution performance when properly configured.

Learning Rate Strategies

# Fixed step size throughout training
learning_rate = 'constant'
eta0 = 0.01

# Update rule:
# θ = θ - 0.01 * gradient
# (same step size every iteration)

How Gradient Descent Works

Algorithm Steps

Initialize: Start with random weights θ
Compute predictions: ŷ = Xθ
Calculate loss: MSE = mean((y - ŷ)²)
Compute gradient: ∇L = -2X^T(y - ŷ) / n
Update weights: θ = θ - η∇L
Repeat: Steps 2-5 until convergence

Stochastic vs Batch

SGDRegressor uses stochastic gradient descent:

Batch GD: Uses all training samples to compute gradient (slow)
Stochastic GD: Uses random subset (mini-batch) for each update (fast)
Advantage: Much faster on large datasets, can escape local minima

# SGDRegressor automatically uses mini-batches
sgd = SGDRegressor(
    loss='squared_error',
    max_iter=1000,
    # Mini-batch sampling handled internally
)

Early Stopping

Both configurations use early stopping to prevent overfitting:

sgd = SGDRegressor(
    early_stopping=True,      # Enable early stopping
    validation_fraction=0.1,  # Use 10% of training data for validation
    n_iter_no_change=5,       # Stop if no improvement for 5 iterations
    tol=1e-3                  # Minimum improvement threshold
)

Early stopping monitors validation loss and stops training when performance plateaus, preventing overfitting and saving computation time.

Full Training Example

from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

# 1. Feature scaling (required!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 2. Configure SGD configurations
sgd_configs = [
    {
        'loss': 'squared_error',
        'learning_rate': 'constant',
        'eta0': 0.01,
        'name': 'SGD (constant)'
    },
    {
        'loss': 'squared_error',
        'learning_rate': 'adaptive',
        'eta0': 0.01,
        'name': 'SGD (adaptive)'
    },
]

# 3. Train and evaluate both configurations
for config in sgd_configs:
    print(f"\nTraining {config['name']}...")
    
    sgd = SGDRegressor(
        loss=config['loss'],
        learning_rate=config['learning_rate'],
        eta0=config['eta0'],
        max_iter=1000,
        random_state=42,
        early_stopping=True,
        validation_fraction=0.1
    )
    
    sgd.fit(X_train_scaled, y_train)
    
    # Evaluate
    y_pred = sgd.predict(X_test_scaled)
    test_r2 = r2_score(y_test, y_pred)
    test_rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    
    print(f"Test R²: {test_r2:.4f}")
    print(f"Test RMSE: {test_rmse:.4f}")
    print(f"Converged in {sgd.n_iter_} iterations")

When to Use Gradient Descent

Use SGDRegressor

Large datasets (>100K samples)
Online learning (streaming data)
Memory constraints
Sparse features
Learning about optimization

Use Linear Regression

Small/medium datasets (under 100K samples)
Need exact solution
Faster training on small data
No feature scaling needed
Production simplicity

Key Takeaways

Adaptive learning rate wins: Test R² of 0.710 matches standard linear regression
Feature scaling is mandatory: StandardScaler required for convergence
Early stopping prevents overfitting: Automatically stops when validation performance plateaus
Iterative approach works: Proves gradient descent can match closed-form solution
Practical for large datasets: More efficient than matrix operations on big data

Hyperparameter Tuning

Key hyperparameters to experiment with:

sgd = SGDRegressor(
    loss='squared_error',           # 'squared_error', 'huber', 'epsilon_insensitive'
    learning_rate='adaptive',       # 'constant', 'optimal', 'invscaling', 'adaptive'
    eta0=0.01,                      # Initial learning rate (try 0.001, 0.01, 0.1)
    max_iter=1000,                  # Maximum iterations (increase if not converging)
    tol=1e-3,                       # Convergence tolerance
    early_stopping=True,            # Enable/disable early stopping
    validation_fraction=0.1,        # Validation set size (0.1 = 10%)
    n_iter_no_change=5,             # Patience for early stopping
    random_state=42                 # For reproducibility
)

Get Started

Core Concepts

Workflows

Model Guide

Overview

Why Gradient Descent?

Feature Scaling Required

Implementation

SGDRegressor with Constant Learning Rate

SGDRegressor with Adaptive Learning Rate

Performance Comparison

Learning Rate Strategies

How Gradient Descent Works

Algorithm Steps

Stochastic vs Batch

Early Stopping

Full Training Example

When to Use Gradient Descent

Use SGDRegressor

Use Linear Regression

Key Takeaways

Hyperparameter Tuning

Next Steps

Advanced Models

Linear Regression

Build docs developers (and LLMs) love

Get Started

Core Concepts

Workflows

Model Guide

​Overview

​Why Gradient Descent?

​Feature Scaling Required

​Implementation

​SGDRegressor with Constant Learning Rate

​SGDRegressor with Adaptive Learning Rate

​Performance Comparison

​Learning Rate Strategies

​How Gradient Descent Works

​Algorithm Steps

​Stochastic vs Batch

​Early Stopping

​Full Training Example

​When to Use Gradient Descent

Use SGDRegressor

Use Linear Regression

​Key Takeaways

​Hyperparameter Tuning

​Next Steps

Advanced Models

Linear Regression

Build docs developers (and LLMs) love

Overview

Why Gradient Descent?

Feature Scaling Required

Implementation

SGDRegressor with Constant Learning Rate

SGDRegressor with Adaptive Learning Rate

Performance Comparison

Learning Rate Strategies

How Gradient Descent Works

Algorithm Steps

Stochastic vs Batch

Early Stopping

Full Training Example

When to Use Gradient Descent

Key Takeaways

Hyperparameter Tuning

Next Steps