Cross-Validation Strategy
Each model is evaluated using a standardized pipeline that includes feature scaling and cross-validation:Key Features
5-Fold Cross-Validation
Each model is evaluated on 5 different train-validation splits to ensure robust performance estimation
Pipeline Integration
Feature scaling and model training are combined in scikit-learn pipelines to prevent data leakage
Standardized Comparison
All models are evaluated using the same cross-validation strategy and preprocessing steps
Mean Score Calculation
Cross-validation scores are averaged to obtain a single performance metric per model
Model Comparison Process
Create Pipeline
For each model, create a scikit-learn Pipeline that first applies StandardScaler to numerical features, then trains the classifier.
Best Model Selection
After comparing all models, the algorithm with the highest cross-validation score is selected:Selected Model: GradientBoosting
GradientBoostingClassifier achieved the highest cross-validation score of 0.91 and was selected as the final model.
Why GradientBoosting Excelled
Gradient Boosting was selected based on its superior cross-validation performance:Ensemble Learning Power
Ensemble Learning Power
Gradient Boosting builds an ensemble of weak learners (decision trees) sequentially, where each tree corrects the errors of previous trees. This iterative error correction leads to strong predictive performance.
Handling Complex Patterns
Handling Complex Patterns
The algorithm excels at capturing non-linear relationships and interactions between features like Price, Discount code, Source, and Use Case.
Robust to Imbalanced Data
Robust to Imbalanced Data
With the Status column grouped into three categories (Closed Won, Closed Lost, Other), Gradient Boosting handles the class distribution effectively.
Cross-Validation Stability
Cross-Validation Stability
The 0.91 CV score indicates consistent performance across different data splits, suggesting good generalization capability.
GradientBoostingClassifier Configuration
- n_estimators: 100 (number of boosting stages)
- learning_rate: 0.1 (shrinks the contribution of each tree)
- max_depth: 3 (maximum depth of individual trees)
- random_state: 42 (for reproducibility)
Training the Final Model
Once selected, the GradientBoosting model is trained on the complete training set:The model generates both class predictions and probability scores, enabling probabilistic lead scoring.
Model Performance Logging
All model scores and the selection process are logged for traceability:reports/model_training.log for audit and analysis purposes.
Next Steps
Evaluation Metrics
Review detailed performance metrics on the test set
Training Overview
Return to the training pipeline overview