Quickstart Guide

This guide will walk you through setting up and running the ecommerce customer spending analysis.

Prerequisites

Before you begin, ensure you have the following installed:

Python 3.7+ - Programming language runtime
pip - Python package manager
Jupyter Notebook - Interactive notebook environment (optional but recommended)

Required Python Packages

pip install pandas matplotlib seaborn scikit-learn ydata-profiling

Package	Purpose
pandas	Data manipulation and analysis
matplotlib	Data visualization
seaborn	Statistical plotting
scikit-learn	Machine learning algorithms
ydata-profiling	Automated data profiling

Project Setup

Clone or Download the Project

Download the project files including:

main.ipynb - Main analysis notebook
data/ecommerce_customers.csv - Customer dataset
README.md - Project documentation

Install Dependencies

Open your terminal and install the required packages:

pip install pandas matplotlib seaborn scikit-learn ydata-profiling

Navigate to Project Directory

cd path/to/ecommerce-linear-regression

Launch Jupyter Notebook

Start the Jupyter Notebook server:

jupyter notebook main.ipynb

This will open the analysis notebook in your browser.

Running the Analysis

1. Import Required Libraries

First, import all necessary Python libraries:

import pandas as pd
from ydata_profiling import ProfileReport
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

2. Load the Dataset

Read the ecommerce customer data from the CSV file:

# Read the data into a DataFrame
db_path = "data/"
df = pd.read_csv(db_path + "ecommerce_customers.csv")

# Display first few rows
df.head()

Expected Output: You’ll see customer records with columns:

Email
Address
Avatar
Avg. Session Length
Time on App
Time on Website
Length of Membership
Yearly Amount Spent

3. Explore the Data

Generate summary statistics:

df.describe()

The dataset contains 500 customer records with no missing values. All numerical features have been normalized for consistent scaling.

4. Prepare Features and Target

Separate the independent variables (features) from the dependent variable (target):

# Select feature columns
X = df[['Avg. Session Length', 'Time on App', 
        'Time on Website', 'Length of Membership']]

# Select target variable
y = df['Yearly Amount Spent']

5. Split the Data

Divide the dataset into training (70%) and testing (30%) sets:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1
)

6. Train the Model

Create and train a linear regression model:

# Initialize the model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

7. Evaluate Model Performance

Calculate key performance metrics:

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")

# Calculate R-squared
r2 = r2_score(y_test, y_pred)
print(f"R² Score: {r2:.4f}")

Expected Results:

Mean Squared Error: 80.90
R² Score: 0.9885

8. Analyze Coefficients

Examine the impact of each feature:

# Get model coefficients
coefficients = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_
})
print(coefficients)

# Get intercept
print(f"\nIntercept: {model.intercept_:.2f}")

Expected Output:

Feature	Coefficient
Avg. Session Length	~25.83
Time on App	~38.81
Time on Website	~0.28
Length of Membership	~61.30

Intercept: ~-1048.82

Interpreting the Results

The coefficients reveal how each factor influences yearly spending:

Length of Membership

Coefficient: 61.30Each additional year of membership increases yearly spending by approximately $61.30

Time on App

Coefficient: 38.81Each additional minute on the mobile app increases spending by $38.81

Avg. Session Length

Coefficient: 25.83Each additional minute of session length adds $25.83 to yearly spending

Time on Website

Coefficient: 0.28Website time has minimal impact - only $0.28 per additional minute

Visualization (Optional)

Visualize the relationship between predicted and actual values:

import matplotlib.pyplot as plt
import seaborn as sns

# Create scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], 
         [y_test.min(), y_test.max()], 
         'r--', lw=2)
plt.xlabel('Actual Yearly Amount Spent')
plt.ylabel('Predicted Yearly Amount Spent')
plt.title('Actual vs Predicted Customer Spending')
plt.tight_layout()
plt.show()

The points should cluster tightly around the diagonal line, indicating accurate predictions.

Business Recommendations

Based on the model results:

Invest in Mobile App - The app coefficient (38.81) is 138x larger than website (0.28), making it the clear priority
Focus on Customer Retention - Length of membership has the strongest impact (61.30), so loyalty programs are critical
Improve Website Experience - The low website coefficient suggests significant untapped potential
Optimize Session Quality - Session length matters, so personalized recommendations and styling advice drive revenue

With an R² score of 0.9885, the model explains 98.85% of variance in customer spending, providing highly reliable insights for strategic decision-making.

Troubleshooting

Common Issues

Import Errors

ModuleNotFoundError: No module named 'sklearn'

Solution: Install scikit-learn:

pip install scikit-learn

File Not Found

FileNotFoundError: [Errno 2] No such file or directory: 'data/ecommerce_customers.csv'

Solution: Ensure you’re in the correct directory and the data folder exists. Jupyter Notebook Not Opening

jupyter: command not found

Solution: Install Jupyter:

pip install jupyter notebook

Next Steps

Now that you’ve run the analysis, consider:

Experimenting with different train/test split ratios
Adding polynomial features for non-linear relationships
Trying other regression algorithms (Ridge, Lasso, ElasticNet)
Performing feature engineering to create new predictive variables
Analyzing residuals to identify model improvements

Back to Introduction

Learn more about the project methodology and insights

Getting Started

Data & Methodology

Model

Results & Insights

Technical Reference

Quickstart Guide

Quickstart Guide

Prerequisites

Required Python Packages

Project Setup

Running the Analysis

1. Import Required Libraries

2. Load the Dataset

3. Explore the Data

4. Prepare Features and Target

5. Split the Data

6. Train the Model

7. Evaluate Model Performance

8. Analyze Coefficients

Interpreting the Results

Length of Membership

Time on App

Avg. Session Length

Time on Website

Visualization (Optional)

Business Recommendations

Troubleshooting

Common Issues

Next Steps

Back to Introduction

Build docs developers (and LLMs) love

Getting Started

Data & Methodology

Model

Results & Insights

Technical Reference

​Quickstart Guide

​Prerequisites

​Required Python Packages

​Project Setup

​Running the Analysis

​1. Import Required Libraries

​2. Load the Dataset

​3. Explore the Data

​4. Prepare Features and Target

​5. Split the Data

​6. Train the Model

​7. Evaluate Model Performance

​8. Analyze Coefficients

​Interpreting the Results

Length of Membership

Time on App

Avg. Session Length

Time on Website

​Visualization (Optional)

​Business Recommendations

​Troubleshooting

​Common Issues

​Next Steps

Back to Introduction

Build docs developers (and LLMs) love

Quickstart Guide

Prerequisites

Required Python Packages

Project Setup

Running the Analysis

1. Import Required Libraries

2. Load the Dataset

3. Explore the Data

4. Prepare Features and Target

5. Split the Data

6. Train the Model

7. Evaluate Model Performance

8. Analyze Coefficients

Interpreting the Results

Visualization (Optional)

Business Recommendations

Troubleshooting

Common Issues

Next Steps