Training Pipeline
The training process follows a systematic approach to model selection and evaluation:Load Processed Data
Read the preprocessed dataset from
data/processed/full_dataset.csv containing all engineered features and encoded categorical variables.Train-Test Split
Split the data into training (80%) and testing (20%) sets with stratification to maintain class distribution.
Model Comparison
Evaluate all 12 classification models using 5-fold cross-validation with standardized features.
Best Model Selection
Select the model with the highest cross-validation score for final training and evaluation.
Classification Models Tested
The training pipeline evaluates 12 different classification algorithms:Model Categories
Ensemble Methods
- Random Forest
- AdaBoost
- Extra Trees
- Bagging Classifier
- Gradient Boosting
Tree-Based
- Decision Tree
Linear Models
- Logistic Regression
- SGD Classifier
Probabilistic
- Naive Bayes
Instance-Based
- K-Nearest Neighbors
Neural Networks
- MLP Classifier
- Support Vector Machine
Feature Preprocessing
The pipeline applies feature standardization to numerical columns before model training:Standardization is applied only to
Price and Discount code features, while other encoded categorical features are passed through unchanged.Training Code Structure
The training logic is implemented in theModelTraining class:
All models use
random_state=42 for reproducibility where applicable.Next Steps
Model Selection
Learn how models are compared and the best model is selected
Evaluation Metrics
View detailed performance metrics and classification results