Skip to main content
The training pipeline evaluates multiple classification algorithms to identify the best model for predicting lead conversion probability. The pipeline compares 12 different models using cross-validation to ensure robust performance.

Training Pipeline

The training process follows a systematic approach to model selection and evaluation:
1

Load Processed Data

Read the preprocessed dataset from data/processed/full_dataset.csv containing all engineered features and encoded categorical variables.
2

Train-Test Split

Split the data into training (80%) and testing (20%) sets with stratification to maintain class distribution.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    random_state=42, 
    shuffle=True, 
    test_size=0.2
)
3

Model Comparison

Evaluate all 12 classification models using 5-fold cross-validation with standardized features.
4

Best Model Selection

Select the model with the highest cross-validation score for final training and evaluation.
5

Final Evaluation

Train the best model on the full training set and evaluate on the held-out test set.

Classification Models Tested

The training pipeline evaluates 12 different classification algorithms:
models = [
    ("RandomForest", RandomForestClassifier(random_state=42)),
    ("Adaboost", AdaBoostClassifier(random_state=42)),
    ("ExtraTree", ExtraTreesClassifier(random_state=42)),
    ("BaggingClassifier", BaggingClassifier(base_estimator=DecisionTreeClassifier(), random_state=42)),
    ("GradientBoosting", GradientBoostingClassifier(random_state=42)),
    ("DecisionTree", DecisionTreeClassifier(random_state=42)),
    ("NaiveBayes", GaussianNB()),
    ("KNN", KNeighborsClassifier()),
    ("Logistic", LogisticRegression(random_state=42)),
    ("SGD Classifier", SGDClassifier(random_state=42)),
    ("MLPClassifier", MLPClassifier(random_state=42)),
    ("SVM", SVC(random_state=42))
]

Model Categories

Ensemble Methods

  • Random Forest
  • AdaBoost
  • Extra Trees
  • Bagging Classifier
  • Gradient Boosting

Tree-Based

  • Decision Tree

Linear Models

  • Logistic Regression
  • SGD Classifier

Probabilistic

  • Naive Bayes

Instance-Based

  • K-Nearest Neighbors

Neural Networks

  • MLP Classifier
  • Support Vector Machine

Feature Preprocessing

The pipeline applies feature standardization to numerical columns before model training:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

ct = ColumnTransformer([
    ('se', StandardScaler(), ['Price', 'Discount code'])
], remainder='passthrough')
Standardization is applied only to Price and Discount code features, while other encoded categorical features are passed through unchanged.

Training Code Structure

The training logic is implemented in the ModelTraining class:
class ModelTraining:
    def __init__(self) -> None:
        self.logger = CustomLogger(
            name='ModelTraining', 
            log_file='model_training.log'
        ).get_logger()
        self.models = [...] # 12 models defined
    
    def get_training_data(self):
        """Load and split the processed dataset"""
        pass
    
    def compare_classifiers(self, X_train, y_train):
        """Compare models using cross-validation"""
        pass
    
    def run(self):
        """Execute the complete training pipeline"""
        pass
All models use random_state=42 for reproducibility where applicable.

Next Steps

Model Selection

Learn how models are compared and the best model is selected

Evaluation Metrics

View detailed performance metrics and classification results

Build docs developers (and LLMs) love