Overview
The data loading module handles CSV ingestion, feature engineering, and train-test splitting with stratification to ensure balanced class distribution.Core Functions
load_dataset()
Loads the raw CSV dataset and applies feature engineering transformations.src/data.py:26
split_data()
Splits the dataset into training and test sets with stratification.src/data.py:43
Dataset Structure
The ML datasource CSV (ml_datasource.csv) contains the following columns:
| Column | Type | Description |
|---|---|---|
student_country | string | Two-letter country code |
days_on_platform | numeric | Days since user registration |
minutes_watched | numeric | Total video minutes consumed |
courses_started | numeric | Number of courses initiated |
practice_exams_started | numeric | Practice exams attempted |
practice_exams_passed | numeric | Practice exams completed successfully |
minutes_spent_on_exams | numeric | Time spent on practice exams |
purchased | binary | Target variable (0/1) |
Configuration
Data loading is configured inconfig.yaml:
Parameters
- path: Path to the CSV file
- target: Name of the target column for classification
- test_size: Proportion of data reserved for testing (0.2 = 20%)
- seed: Random seed for reproducibility
Preprocessing Pipeline
The data loading process follows these steps:- Read CSV: Load raw data from
ml_datasource.csv - Feature Engineering: Apply transformations via
add_engineered_features()(see Feature Engineering) - Train-Test Split: Stratified split to maintain class balance
- Stratification: Ensures equal distribution of
purchased=0andpurchased=1in both sets
Test Mode
WhenTEST_MODE is enabled, the loader limits rows for faster testing:
Related Functions
load_config(): Loads YAML configuration (src/data.py:16)set_global_seed(): Sets random seeds for reproducibility (src/data.py:21)add_engineered_features(): Creates derived features (see Feature Engineering)
Next Steps
Feature Engineering
Learn how engineered features are created from raw data
Model Selection
Explore model training and cross-validation