Skip to main content

Quickstart Guide

This guide will walk you through setting up and running the ecommerce customer spending analysis.

Prerequisites

Before you begin, ensure you have the following installed:
  • Python 3.7+ - Programming language runtime
  • pip - Python package manager
  • Jupyter Notebook - Interactive notebook environment (optional but recommended)

Required Python Packages

pip install pandas matplotlib seaborn scikit-learn ydata-profiling
PackagePurpose
pandasData manipulation and analysis
matplotlibData visualization
seabornStatistical plotting
scikit-learnMachine learning algorithms
ydata-profilingAutomated data profiling

Project Setup

1

Clone or Download the Project

Download the project files including:
  • main.ipynb - Main analysis notebook
  • data/ecommerce_customers.csv - Customer dataset
  • README.md - Project documentation
2

Install Dependencies

Open your terminal and install the required packages:
pip install pandas matplotlib seaborn scikit-learn ydata-profiling
3

Navigate to Project Directory

cd path/to/ecommerce-linear-regression
4

Launch Jupyter Notebook

Start the Jupyter Notebook server:
jupyter notebook main.ipynb
This will open the analysis notebook in your browser.

Running the Analysis

1. Import Required Libraries

First, import all necessary Python libraries:
import pandas as pd
from ydata_profiling import ProfileReport
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

2. Load the Dataset

Read the ecommerce customer data from the CSV file:
# Read the data into a DataFrame
db_path = "data/"
df = pd.read_csv(db_path + "ecommerce_customers.csv")

# Display first few rows
df.head()
Expected Output: You’ll see customer records with columns:
  • Email
  • Address
  • Avatar
  • Avg. Session Length
  • Time on App
  • Time on Website
  • Length of Membership
  • Yearly Amount Spent

3. Explore the Data

Generate summary statistics:
df.describe()
The dataset contains 500 customer records with no missing values. All numerical features have been normalized for consistent scaling.

4. Prepare Features and Target

Separate the independent variables (features) from the dependent variable (target):
# Select feature columns
X = df[['Avg. Session Length', 'Time on App', 
        'Time on Website', 'Length of Membership']]

# Select target variable
y = df['Yearly Amount Spent']

5. Split the Data

Divide the dataset into training (70%) and testing (30%) sets:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1
)

6. Train the Model

Create and train a linear regression model:
# Initialize the model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

7. Evaluate Model Performance

Calculate key performance metrics:
# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")

# Calculate R-squared
r2 = r2_score(y_test, y_pred)
print(f"R² Score: {r2:.4f}")
Expected Results:
Mean Squared Error: 80.90
R² Score: 0.9885

8. Analyze Coefficients

Examine the impact of each feature:
# Get model coefficients
coefficients = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_
})
print(coefficients)

# Get intercept
print(f"\nIntercept: {model.intercept_:.2f}")
Expected Output:
FeatureCoefficient
Avg. Session Length~25.83
Time on App~38.81
Time on Website~0.28
Length of Membership~61.30
Intercept: ~-1048.82

Interpreting the Results

The coefficients reveal how each factor influences yearly spending:

Length of Membership

Coefficient: 61.30Each additional year of membership increases yearly spending by approximately $61.30

Time on App

Coefficient: 38.81Each additional minute on the mobile app increases spending by $38.81

Avg. Session Length

Coefficient: 25.83Each additional minute of session length adds $25.83 to yearly spending

Time on Website

Coefficient: 0.28Website time has minimal impact - only $0.28 per additional minute

Visualization (Optional)

Visualize the relationship between predicted and actual values:
import matplotlib.pyplot as plt
import seaborn as sns

# Create scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], 
         [y_test.min(), y_test.max()], 
         'r--', lw=2)
plt.xlabel('Actual Yearly Amount Spent')
plt.ylabel('Predicted Yearly Amount Spent')
plt.title('Actual vs Predicted Customer Spending')
plt.tight_layout()
plt.show()
The points should cluster tightly around the diagonal line, indicating accurate predictions.

Business Recommendations

Based on the model results:
  1. Invest in Mobile App - The app coefficient (38.81) is 138x larger than website (0.28), making it the clear priority
  2. Focus on Customer Retention - Length of membership has the strongest impact (61.30), so loyalty programs are critical
  3. Improve Website Experience - The low website coefficient suggests significant untapped potential
  4. Optimize Session Quality - Session length matters, so personalized recommendations and styling advice drive revenue
With an R² score of 0.9885, the model explains 98.85% of variance in customer spending, providing highly reliable insights for strategic decision-making.

Troubleshooting

Common Issues

Import Errors
ModuleNotFoundError: No module named 'sklearn'
Solution: Install scikit-learn:
pip install scikit-learn
File Not Found
FileNotFoundError: [Errno 2] No such file or directory: 'data/ecommerce_customers.csv'
Solution: Ensure you’re in the correct directory and the data folder exists. Jupyter Notebook Not Opening
jupyter: command not found
Solution: Install Jupyter:
pip install jupyter notebook

Next Steps

Now that you’ve run the analysis, consider:
  • Experimenting with different train/test split ratios
  • Adding polynomial features for non-linear relationships
  • Trying other regression algorithms (Ridge, Lasso, ElasticNet)
  • Performing feature engineering to create new predictive variables
  • Analyzing residuals to identify model improvements

Back to Introduction

Learn more about the project methodology and insights

Build docs developers (and LLMs) love