Data Preparation

Data Loading

The dataset is loaded using pandas’ read_csv function from a local CSV file:

import pandas as pd
from ydata_profiling import ProfileReport
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Read the data into a DataFrame
db_path = "data/"
df = pd.read_csv(db_path+"ecommerce_customers.csv")

Initial Data Exploration

After loading, the dataset is examined to understand its structure:

# Display first few rows
df.head()

This reveals the 8 columns: Email, Address, Avatar, Avg. Session Length, Time on App, Time on Website, Length of Membership, and Yearly Amount Spent.

# Generate descriptive statistics
df.describe()

The describe() function provides statistical summaries for all numeric columns, confirming 500 complete records across all features.

Feature Selection

For the linear regression model, we separate the dataset into independent variables (features) and the dependent variable (target).

Independent Variables (X)

Four numeric features are selected as predictors:

cols = ['Avg. Session Length', 'Time on App', 'Time on Website', 'Length of Membership']

X = df[cols]

The non-numeric columns (Email, Address, Avatar) are excluded as they don’t provide meaningful predictive value for regression analysis.

Target Variable (y)

The dependent variable is the annual spending amount:

y = df['Yearly Amount Spent']

Train-Test Split

The dataset is divided into training and testing sets using a 70-30 split ratio:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

Split Configuration

test_size

float

default:"0.3"

30% of the data (150 customers) is reserved for testing

random_state

int

default:"1"

Ensures reproducibility of the random split

Resulting Datasets

Training Set: 350 customers (70%)
- Used to train the linear regression model
- Model learns the relationship between features and yearly spending
Testing Set: 150 customers (30%)
- Used to evaluate model performance
- Provides unbiased assessment of prediction accuracy

The 70-30 split is a common practice in machine learning that balances having enough data for training while maintaining a substantial test set for validation.