Training Overview

The training pipeline evaluates multiple classification algorithms to identify the best model for predicting lead conversion probability. The pipeline compares 12 different models using cross-validation to ensure robust performance.

Training Pipeline

The training process follows a systematic approach to model selection and evaluation:

Load Processed Data

Read the preprocessed dataset from data/processed/full_dataset.csv containing all engineered features and encoded categorical variables.

Train-Test Split

Split the data into training (80%) and testing (20%) sets with stratification to maintain class distribution.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    random_state=42, 
    shuffle=True, 
    test_size=0.2
)

Model Comparison

Evaluate all 12 classification models using 5-fold cross-validation with standardized features.

Best Model Selection

Select the model with the highest cross-validation score for final training and evaluation.

Final Evaluation

Train the best model on the full training set and evaluate on the held-out test set.

Classification Models Tested

The training pipeline evaluates 12 different classification algorithms:

models = [
    ("RandomForest", RandomForestClassifier(random_state=42)),
    ("Adaboost", AdaBoostClassifier(random_state=42)),
    ("ExtraTree", ExtraTreesClassifier(random_state=42)),
    ("BaggingClassifier", BaggingClassifier(base_estimator=DecisionTreeClassifier(), random_state=42)),
    ("GradientBoosting", GradientBoostingClassifier(random_state=42)),
    ("DecisionTree", DecisionTreeClassifier(random_state=42)),
    ("NaiveBayes", GaussianNB()),
    ("KNN", KNeighborsClassifier()),
    ("Logistic", LogisticRegression(random_state=42)),
    ("SGD Classifier", SGDClassifier(random_state=42)),
    ("MLPClassifier", MLPClassifier(random_state=42)),
    ("SVM", SVC(random_state=42))
]

Model Categories

Ensemble Methods

Random Forest
AdaBoost
Extra Trees
Bagging Classifier
Gradient Boosting

Tree-Based

Decision Tree

Linear Models

Logistic Regression
SGD Classifier

Probabilistic

Naive Bayes

Instance-Based

K-Nearest Neighbors

Neural Networks

MLP Classifier
Support Vector Machine

Feature Preprocessing

The pipeline applies feature standardization to numerical columns before model training:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

ct = ColumnTransformer([
    ('se', StandardScaler(), ['Price', 'Discount code'])
], remainder='passthrough')

Standardization is applied only to Price and Discount code features, while other encoded categorical features are passed through unchanged.

Training Code Structure

The training logic is implemented in the ModelTraining class:

class ModelTraining:
    def __init__(self) -> None:
        self.logger = CustomLogger(
            name='ModelTraining', 
            log_file='model_training.log'
        ).get_logger()
        self.models = [...] # 12 models defined
    
    def get_training_data(self):
        """Load and split the processed dataset"""
        pass
    
    def compare_classifiers(self, X_train, y_train):
        """Compare models using cross-validation"""
        pass
    
    def run(self):
        """Execute the complete training pipeline"""
        pass

All models use random_state=42 for reproducibility where applicable.

Next Steps

Model Selection

Learn how models are compared and the best model is selected

Evaluation Metrics

View detailed performance metrics and classification results

Training

Prediction

Training Pipeline

Classification Models Tested

Model Categories

Ensemble Methods

Tree-Based

Linear Models

Probabilistic

Instance-Based

Neural Networks

Feature Preprocessing

Training Code Structure

Next Steps

Model Selection

Evaluation Metrics

Build docs developers (and LLMs) love

Training

Prediction

​Training Pipeline

​Classification Models Tested

​Model Categories

Ensemble Methods

Tree-Based

Linear Models

Probabilistic

Instance-Based

Neural Networks

​Feature Preprocessing

​Training Code Structure

​Next Steps

Model Selection

Evaluation Metrics

Build docs developers (and LLMs) love

Training Pipeline

Classification Models Tested

Model Categories

Feature Preprocessing

Training Code Structure

Next Steps