Skip to main content

What is Regression?

Regression is a supervised learning task where the goal is to predict a continuous numerical output based on input features. Real-world applications:
  • Predicting house prices based on size, location, and features
  • Forecasting sales revenue for business planning
  • Estimating customer lifetime value
  • Predicting delivery times for logistics optimization

Module A6 Project: E-Commerce Sales Prediction

In this module, you’ll build regression models to predict total_sales for e-commerce orders using the Amazon sales dataset. Dataset: 10,000 orders with 23 features including:
  • Customer attributes (country, state, city)
  • Product information (category, subcategory, brand)
  • Order details (quantity, unit price, discount, shipping cost)
  • Logistics data (order date, ship date, delivery date)
Target variable: total_sales - the total amount of the order

Linear Regression

The simplest and most interpretable regression model.

How it works

Linear regression finds the best-fit line through your data:
y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ
Where:
  • y is the predicted value
  • β₀ is the intercept
  • β₁, β₂, ..., βₙ are the coefficients (weights)
  • x₁, x₂, ..., xₙ are the features

Implementation

Linear Regression with scikit-learn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MAE: ${mae:.2f}")
print(f"R²: {r2:.4f}")
Linear regression assumes:
  • Linear relationship between features and target
  • Features are independent (no multicollinearity)
  • Residuals are normally distributed
  • Constant variance of residuals (homoscedasticity)

Polynomial Regression

Capture non-linear relationships by creating polynomial features.

Creating polynomial features

Polynomial Features
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

# Create pipeline with polynomial features
pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=2)),
    ('model', LinearRegression())
])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
Example: With degree=2 and features [x₁, x₂]:
  • Original: [x₁, x₂]
  • Polynomial: [1, x₁, x₂, x₁², x₁x₂, x₂²]
Higher polynomial degrees can lead to overfitting. Use cross-validation to choose the optimal degree.

Regularized Regression

Prevent overfitting by penalizing large coefficients.

Ridge Regression (L2 Regularization)

Adds penalty proportional to square of coefficients.
Ridge Regression
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

# Hyperparameter tuning
param_grid = {'alpha': [0.01, 0.1, 1, 10, 100]}

ridge = Ridge()
grid_search = GridSearchCV(
    ridge, param_grid, cv=5, scoring='r2'
)
grid_search.fit(X_train, y_train)

print(f"Best alpha: {grid_search.best_params_['alpha']}")
print(f"Best R²: {grid_search.best_score_:.4f}")

# Use best model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
When to use:
  • Many correlated features
  • Want to keep all features but reduce their impact
  • Prevent overfitting

Lasso Regression (L1 Regularization)

Adds penalty proportional to absolute value of coefficients. Can shrink coefficients to zero (feature selection).
Lasso Regression
from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)

# Features with non-zero coefficients
selected_features = X.columns[lasso.coef_ != 0]
print(f"Selected features: {len(selected_features)}/{len(X.columns)}")
When to use:
  • Want automatic feature selection
  • Have many irrelevant features
  • Need a sparse model

Ensemble Methods

Combine multiple models for better predictions.

Gradient Boosting Regressor

Sequentially builds trees, each correcting errors of previous trees.
Gradient Boosting
from sklearn.ensemble import GradientBoostingRegressor

# Define hyperparameters
param_grid = {
    'n_estimators': [100, 200],
    'learning_rate': [0.01, 0.1],
    'max_depth': [3, 5, 7]
}

gbr = GradientBoostingRegressor(random_state=42)
grid_search = GridSearchCV(
    gbr, param_grid, cv=5, scoring='neg_mean_squared_error'
)
grid_search.fit(X_train, y_train)

# Best model
best_gbr = grid_search.best_estimator_
y_pred = best_gbr.predict(X_test)

# Feature importance
importances = best_gbr.feature_importances_
for feat, imp in zip(X.columns, importances):
    print(f"{feat}: {imp:.4f}")
Advantages:
  • Often best performance
  • Handles non-linear relationships
  • Provides feature importance
  • Robust to outliers
Disadvantages:
  • Slower to train
  • More hyperparameters to tune
  • Can overfit without proper tuning

Module A6 Project Results

From the e-commerce sales prediction project:
ModelMAE ($)RMSE ($)
Linear Regression (baseline)45.2058.300.7450
Linear Regression (full)38.1549.800.8120
Polynomial (degree=2)35.6046.200.8340
Ridge (optimized)34.8045.100.8420
Gradient Boosting32.1541.500.8823
The Gradient Boosting model achieved the best performance, with an average prediction error of $32.15 and explaining 88.23% of variance in sales.

Feature Importance

Top features influencing sales predictions:
  1. Unit price (importance: 0.45) - Strongest predictor
  2. Quantity (importance: 0.28) - Number of items ordered
  3. Shipping cost (importance: 0.12) - Logistics impact
  4. Discount (importance: 0.08) - Promotion effect
  5. Product category (importance: 0.07) - Category variations

Complete Pipeline with Preprocessing

Full ML Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# Define preprocessing
numeric_features = ['quantity', 'unit_price', 'discount', 'shipping_cost']
categorical_features = ['category', 'subcategory', 'country']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])

# Create pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', GradientBoostingRegressor(n_estimators=200, learning_rate=0.1, max_depth=5))
])

# Train
pipeline.fit(X_train, y_train)

# Predict
y_pred = pipeline.predict(X_test)

# Evaluate
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"MAE: ${mae:.2f}")
print(f"RMSE: ${rmse:.2f}")
print(f"R²: {r2:.4f}")

Best Practices

Always start with a baseline model. Train a simple linear regression first to establish performance expectations.
Use cross-validation for model selection. Don’t rely on a single train-test split - use K-fold cross-validation to get more reliable performance estimates.
Feature scaling matters. Algorithms like Ridge and Lasso are sensitive to feature scales. Always use StandardScaler or MinMaxScaler.
More complex doesn’t always mean better. Sometimes a well-tuned Ridge regression performs nearly as well as Gradient Boosting with much faster training and prediction.

Business Value

Predicting sales enables:
  1. Personalized marketing: Target high-value customers with custom offers
  2. Inventory management: Stock popular items, reduce slow movers
  3. Revenue forecasting: Accurate quarterly and annual projections
  4. Dynamic pricing: Adjust prices based on predicted demand
  5. Customer segmentation: Identify high-value vs. low-value customers

Next Steps

Classification models

Learn to predict categories instead of numbers

Model evaluation

Deep dive into metrics and model comparison

Project walkthrough

Complete implementation guide for the e-commerce project

Build docs developers (and LLMs) love