Overview
The training pipeline evaluates five classification algorithms using stratified k-fold cross-validation to select the best performing model based on ROC AUC score.Model Types
Five models are trained and compared:- Logistic Regression: Linear baseline model
- K-Nearest Neighbors (KNN): Distance-based classifier
- Support Vector Machine (SVM): Kernel-based separator
- Decision Tree: Rule-based hierarchical model
- Random Forest: Ensemble of decision trees
build_models()
Creates configured instances of all five models. Implementation:src/train.py:58
Model Configuration
All hyperparameters are defined inconfig.yaml:
Hyperparameter Explanations
Logistic Regression
- max_iter: Maximum iterations for convergence (2000)
K-Nearest Neighbors
- n_neighbors: Number of neighbors to consider (7)
Support Vector Machine
- C: Regularization parameter (1.0)
- kernel: Kernel type -
rbf(radial basis function) - gamma: Kernel coefficient -
scale(auto-computed) - probability: Enable probability estimates (required for predict_proba)
Decision Tree
- max_depth: Maximum tree depth (8)
- min_samples_leaf: Minimum samples per leaf node (10)
Random Forest
- n_estimators: Number of trees in forest (400)
- min_samples_leaf: Minimum samples per leaf node (2)
- n_jobs: Parallel jobs (-1 = use all CPUs)
Cross-Validation Setup
Stratified k-fold cross-validation with 5 splits. Implementation:src/train.py:126-142
Stratified K-Fold
Why Stratified?- Maintains class distribution in each fold
- Critical for imbalanced datasets
- Ensures each fold has representative samples of both classes (purchased=0 and purchased=1)
config.yaml):
Evaluation Metrics
Four metrics are computed for each fold:- ROC AUC: Area under ROC curve (primary metric)
- Precision: True positives / (true positives + false positives)
- Recall: True positives / (true positives + false negatives)
- F1: Harmonic mean of precision and recall
Model Selection Process
- Train all models with 5-fold cross-validation
- Compute mean scores across folds for each metric
- Rank by ROC AUC (primary metric)
- Select best model with highest mean ROC AUC
- Retrain on full training set
src/train.py:155-157
Preprocessing Pipeline
Each model is wrapped in a scikit-learn Pipeline with preprocessing. Implementation:src/train.py:34-55
Preprocessing Steps
Numeric Features:- StandardScaler: Zero mean, unit variance normalization
- OneHotEncoder: Convert categories to binary vectors
handle_unknown="ignore": Handle new categories in test data
CV Results Format
Cross-validation results are stored inmetrics.json:
Usage Example
Next Steps
Evaluation
Learn about model evaluation and threshold calibration
Feature Engineering
Understand features used by models