Data Loading
The dataset is loaded using pandas’read_csv function from a local CSV file:
Initial Data Exploration
After loading, the dataset is examined to understand its structure:describe() function provides statistical summaries for all numeric columns, confirming 500 complete records across all features.
Feature Selection
For the linear regression model, we separate the dataset into independent variables (features) and the dependent variable (target).Independent Variables (X)
Four numeric features are selected as predictors:The non-numeric columns (Email, Address, Avatar) are excluded as they don’t provide meaningful predictive value for regression analysis.
Target Variable (y)
The dependent variable is the annual spending amount:Train-Test Split
The dataset is divided into training and testing sets using a 70-30 split ratio:Split Configuration
30% of the data (150 customers) is reserved for testing
Ensures reproducibility of the random split
Resulting Datasets
-
Training Set: 350 customers (70%)
- Used to train the linear regression model
- Model learns the relationship between features and yearly spending
-
Testing Set: 150 customers (30%)
- Used to evaluate model performance
- Provides unbiased assessment of prediction accuracy
The 70-30 split is a common practice in machine learning that balances having enough data for training while maintaining a substantial test set for validation.
Data Preparation Summary
The preparation pipeline follows these steps:- Load CSV data into pandas DataFrame
- Explore dataset structure and statistics
- Select 4 numeric features as independent variables
- Define Yearly Amount Spent as the target variable
- Split data into 70% training and 30% testing sets