ModelTraining Class
TheModelTraining class handles the complete machine learning pipeline for lead scoring, including data loading, model comparison, training, and prediction generation.
Class Initialization
__init__()
Initializes the ModelTraining class with a custom logger and defines 12 classification models to compare.
Instance reference
CustomLogger instance configured with name ‘ModelTraining’ and log file ‘model_training.log’
List of tuples containing model names and their sklearn estimator instances:
- RandomForest (RandomForestClassifier)
- Adaboost (AdaBoostClassifier)
- ExtraTree (ExtraTreesClassifier)
- BaggingClassifier (with DecisionTreeClassifier base estimator)
- GradientBoosting (GradientBoostingClassifier)
- DecisionTree (DecisionTreeClassifier)
- NaiveBayes (GaussianNB)
- KNN (KNeighborsClassifier)
- Logistic (LogisticRegression)
- SGD Classifier (SGDClassifier)
- MLPClassifier (Multi-layer Perceptron)
- SVM (Support Vector Machine)
Methods
get_training_data()
Reads the processed dataset and splits it into training and testing sets with an 80/20 split.
Instance reference
Training features (80% of data)
Testing features (20% of data)
Training labels - Status column (Closed Won, Closed Lost, Other)
Testing labels - Status column (Closed Won, Closed Lost, Other)
compare_classifiers(X_train, y_train)
Compares all 12 classification models using 5-fold cross-validation and returns their mean scores.
Training feature data
Training target labels
Dictionary mapping model names to their mean cross-validation scores. Each model is evaluated using a pipeline with StandardScaler applied to ‘Price’ and ‘Discount code’ columns.
get_lead_distribution(X_test, y_predicted, y_probabilities)
Generates lead distribution analysis by mapping predictions to original data and categorizing leads by conversion probability.
Test feature data with original index preserved
Predicted class labels as integers (0: Closed Lost, 1: Closed Won, 2: Other)
Prediction probabilities array with shape (n_samples, n_classes)
DataFrame containing:
- Observation: Sequential observation number
- Use Case: Original use case from data
- Discount code: Discount code value
- Loss Reason: Reason for loss (if applicable)
- Source: Lead source
- City: City location
- Predicted Class: Mapped prediction (“Closed Won”, “Closed Lost”, “Other”)
Count of observations in each probability range:
- 25%: Probability 0-0.25
- 50%: Probability 0.25-0.5
- 75%: Probability 0.5-0.75
- 100%: Probability 0.75-1.0
run()
Executes the complete training pipeline: loads data, compares classifiers, trains the best model, and generates predictions.
Instance reference
Dictionary containing training results:
- predictions_df (pd.DataFrame): Lead predictions with factors
- probability_distribution (pd.Series): Distribution of prediction probabilities
- accuracy_score (float): Model accuracy on test set