Skip to main content
The model selection process evaluates all 12 classification algorithms using 5-fold cross-validation to identify the best performer for lead scoring prediction.

Cross-Validation Strategy

Each model is evaluated using a standardized pipeline that includes feature scaling and cross-validation:
def compare_classifiers(self, X_train, y_train):
    """
    Compare various classification models and returns their scores from cross validation

    Args:
        X_train (pd.Dataframe): X_train data
        y_train (pd.Dataframe): y_train data

    Returns:
        dict: Model names with their scores from cross validation
    """
    # Define column transformation
    ct = ColumnTransformer([
        ('se', StandardScaler(), ['Price', 'Discount code'])
    ], remainder='passthrough')

    # Create pipelines and evaluate models
    scores = {}
    for name, model in self.models:
        pipeline = Pipeline([('transformer', ct), (name, model)])
        cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5)
        scores[name] = np.mean(cv_scores)

    return scores

Key Features

5-Fold Cross-Validation

Each model is evaluated on 5 different train-validation splits to ensure robust performance estimation

Pipeline Integration

Feature scaling and model training are combined in scikit-learn pipelines to prevent data leakage

Standardized Comparison

All models are evaluated using the same cross-validation strategy and preprocessing steps

Mean Score Calculation

Cross-validation scores are averaged to obtain a single performance metric per model

Model Comparison Process

1

Create Pipeline

For each model, create a scikit-learn Pipeline that first applies StandardScaler to numerical features, then trains the classifier.
2

Run Cross-Validation

Execute 5-fold cross-validation on the training data:
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5)
3

Calculate Mean Score

Compute the average cross-validation score across all 5 folds:
scores[name] = np.mean(cv_scores)
4

Store Results

Save the mean score for each model in a dictionary for comparison.

Best Model Selection

After comparing all models, the algorithm with the highest cross-validation score is selected:
# Show scores
self.logger.info("Model scores:")
for model, score in model_scores.items():
    self.logger.info(f"{model}: {score}")

best_model_name = max(model_scores, key=model_scores.get)
best_model_index = [name for name, _ in self.models].index(best_model_name)
best_model = self.models[best_model_index][1]
self.logger.info(
    f"Best Model: {best_model_name} with Score: {model_scores[best_model_name]}"
)

Selected Model: GradientBoosting

GradientBoostingClassifier achieved the highest cross-validation score of 0.91 and was selected as the final model.

Why GradientBoosting Excelled

Gradient Boosting was selected based on its superior cross-validation performance:
Gradient Boosting builds an ensemble of weak learners (decision trees) sequentially, where each tree corrects the errors of previous trees. This iterative error correction leads to strong predictive performance.
The algorithm excels at capturing non-linear relationships and interactions between features like Price, Discount code, Source, and Use Case.
With the Status column grouped into three categories (Closed Won, Closed Lost, Other), Gradient Boosting handles the class distribution effectively.
The 0.91 CV score indicates consistent performance across different data splits, suggesting good generalization capability.

GradientBoostingClassifier Configuration

GradientBoostingClassifier(random_state=42)
The model uses default scikit-learn hyperparameters:
  • n_estimators: 100 (number of boosting stages)
  • learning_rate: 0.1 (shrinks the contribution of each tree)
  • max_depth: 3 (maximum depth of individual trees)
  • random_state: 42 (for reproducibility)

Training the Final Model

Once selected, the GradientBoosting model is trained on the complete training set:
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)

# Generate probability predictions
y_probabilities = best_model.predict_proba(X_test)
y_predicted = np.argmax(y_probabilities, axis=1)
The model generates both class predictions and probability scores, enabling probabilistic lead scoring.

Model Performance Logging

All model scores and the selection process are logged for traceability:
self.logger = CustomLogger(
    name='ModelTraining', 
    log_file='model_training.log'
).get_logger()
Logs are saved to reports/model_training.log for audit and analysis purposes.

Next Steps

Evaluation Metrics

Review detailed performance metrics on the test set

Training Overview

Return to the training pipeline overview

Build docs developers (and LLMs) love