Overview
This project builds supervised regression models to predict the total sales amount per order for an e-commerce company. The goal is to enable personalized campaigns, optimize stock management, and forecast revenue based on customer, product, and logistics data. Business Objective: Anticipate order value to:- Personalize marketing campaigns
- Optimize inventory levels
- Estimate expected revenue per customer
- Identify high-value customer segments
- Set dynamic pricing strategies
Project Structure
Dataset
Source:amazon_sales_dataset.csv
- Records: 10,000 orders
- Features: 23 columns
- Target:
total_sales(order amount in dollars)
Variables
Temporal:order_date: Order placement dateship_date: Shipment datedelivery_date: Delivery date
customer_id: Customer identifiercustomer_name: Customer namecountry,state,city: Geographic location
product_id,product_name: Product identifierscategory,sub_category: Product classificationbrand: Product brand
quantity: Units orderedunit_price: Price per unitdiscount: Discount percentage (0-1)shipping_cost: Shipping feetotal_sales: Target variable (quantity × unit_price - discount + shipping)
Engineered Features
Data Preparation
1. Handle Missing Values
2. Feature Selection
3. Train-Test Split
4. Preprocessing Pipeline
Modeling Strategy
1. Baseline Model: Simple Linear Regression
Establish baseline performance:2. Linear Regression with Full Features
3. Polynomial Regression
Capture non-linear relationships:4. Ridge Regression (L2 Regularization)
Prevent overfitting with regularization:5. Gradient Boosting Regressor
Advanced ensemble method:Model Comparison
| Model | MAE | RMSE | R² |
|---|---|---|---|
| Baseline | $45.32 | $67.18 | 0.7892 |
| Linear | $38.21 | $54.76 | 0.8456 |
| Polynomial | $35.67 | $51.23 | 0.8612 |
| Ridge | $37.89 | $54.12 | 0.8478 |
| Gradient Boosting | $32.15 | $47.89 | 0.8823 |
Feature Importance
unit_price(0.42)quantity(0.28)shipping_cost(0.11)discount(0.08)total_delay_days(0.04)
Residual Analysis
Final Model Selection
Selected Model: Gradient Boosting Regressor Reasons:- Lowest error: MAE = 47.89
- Highest R²: 0.8823 (explains 88.23% of variance)
- Stable cross-validation: Consistent performance across folds
- Feature importance: Provides interpretability
- Generalization: Best balance between bias and variance
Model Deployment
Business Value
1. Campaign Personalization
- Predict customer lifetime value
- Target high-value customers with premium offers
- Customize discount strategies by segment
2. Inventory Optimization
- Forecast demand for high-value products
- Reduce stockouts of best-sellers
- Minimize excess inventory costs
3. Revenue Forecasting
- Estimate monthly/quarterly revenue
- Set realistic sales targets
- Allocate marketing budgets effectively
4. Dynamic Pricing
- Adjust prices based on predicted demand
- Optimize discount levels to maximize profit
- Implement surge pricing during peak periods
Limitations and Future Work
Limitations
- Data quality: Model depends on accurate historical data
- Feature engineering: May benefit from additional behavioral features (clicks, time on page)
- Temporal dynamics: Current model doesn’t capture time-series patterns
- External factors: Doesn’t account for seasonality, promotions, or competitor actions
Future Work
- Advanced models: Try XGBoost, LightGBM, neural networks
- Time-series: Implement ARIMA or Prophet for temporal forecasting
- Real-time features: Integrate browsing behavior and session data
- A/B testing: Validate model impact on business metrics
- Model monitoring: Track prediction drift and retrain periodically
- Explainability: Implement SHAP values for individual predictions