Skip to main content

ModelTraining Class

The ModelTraining class handles the complete machine learning pipeline for lead scoring, including data loading, model comparison, training, and prediction generation.

Class Initialization

__init__()

Initializes the ModelTraining class with a custom logger and defines 12 classification models to compare.
self
ModelTraining
Instance reference
logger
logging.Logger
CustomLogger instance configured with name ‘ModelTraining’ and log file ‘model_training.log’
models
list[tuple]
List of tuples containing model names and their sklearn estimator instances:
  • RandomForest (RandomForestClassifier)
  • Adaboost (AdaBoostClassifier)
  • ExtraTree (ExtraTreesClassifier)
  • BaggingClassifier (with DecisionTreeClassifier base estimator)
  • GradientBoosting (GradientBoostingClassifier)
  • DecisionTree (DecisionTreeClassifier)
  • NaiveBayes (GaussianNB)
  • KNN (KNeighborsClassifier)
  • Logistic (LogisticRegression)
  • SGD Classifier (SGDClassifier)
  • MLPClassifier (Multi-layer Perceptron)
  • SVM (Support Vector Machine)
from src.models.train_model import ModelTraining

# Initialize the trainer
trainer = ModelTraining()

Methods

get_training_data()

Reads the processed dataset and splits it into training and testing sets with an 80/20 split.
self
ModelTraining
Instance reference
X_train
pd.DataFrame
Training features (80% of data)
X_test
pd.DataFrame
Testing features (20% of data)
y_train
pd.Series
Training labels - Status column (Closed Won, Closed Lost, Other)
y_test
pd.Series
Testing labels - Status column (Closed Won, Closed Lost, Other)
trainer = ModelTraining()
X_train, X_test, y_train, y_test = trainer.get_training_data()

print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")

compare_classifiers(X_train, y_train)

Compares all 12 classification models using 5-fold cross-validation and returns their mean scores.
X_train
pd.DataFrame
required
Training feature data
y_train
pd.Series
required
Training target labels
scores
dict
Dictionary mapping model names to their mean cross-validation scores. Each model is evaluated using a pipeline with StandardScaler applied to ‘Price’ and ‘Discount code’ columns.
trainer = ModelTraining()
X_train, X_test, y_train, y_test = trainer.get_training_data()

# Compare all classifiers
model_scores = trainer.compare_classifiers(X_train, y_train)

# Display results
for model_name, score in model_scores.items():
    print(f"{model_name}: {score:.4f}")

# Find best model
best_model = max(model_scores, key=model_scores.get)
print(f"\nBest Model: {best_model} with score {model_scores[best_model]:.4f}")

get_lead_distribution(X_test, y_predicted, y_probabilities)

Generates lead distribution analysis by mapping predictions to original data and categorizing leads by conversion probability.
X_test
pd.DataFrame
required
Test feature data with original index preserved
y_predicted
np.ndarray
required
Predicted class labels as integers (0: Closed Lost, 1: Closed Won, 2: Other)
y_probabilities
np.ndarray
required
Prediction probabilities array with shape (n_samples, n_classes)
impact_df
pd.DataFrame
DataFrame containing:
  • Observation: Sequential observation number
  • Use Case: Original use case from data
  • Discount code: Discount code value
  • Loss Reason: Reason for loss (if applicable)
  • Source: Lead source
  • City: City location
  • Predicted Class: Mapped prediction (“Closed Won”, “Closed Lost”, “Other”)
probability_distribution
pd.Series
Count of observations in each probability range:
  • 25%: Probability 0-0.25
  • 50%: Probability 0.25-0.5
  • 75%: Probability 0.5-0.75
  • 100%: Probability 0.75-1.0
import numpy as np

trainer = ModelTraining()
X_train, X_test, y_train, y_test = trainer.get_training_data()

# Train a model (example with GradientBoosting)
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(random_state=42)
model.fit(X_train, y_train)

# Get probabilities and predictions
y_probabilities = model.predict_proba(X_test)
y_predicted = np.argmax(y_probabilities, axis=1)

# Get distribution analysis
impact_df, prob_dist = trainer.get_lead_distribution(X_test, y_predicted, y_probabilities)

print("Probability Distribution:")
print(prob_dist)
print("\nSample predictions:")
print(impact_df.head())

run()

Executes the complete training pipeline: loads data, compares classifiers, trains the best model, and generates predictions.
self
ModelTraining
Instance reference
data
dict
Dictionary containing training results:
  • predictions_df (pd.DataFrame): Lead predictions with factors
  • probability_distribution (pd.Series): Distribution of prediction probabilities
  • accuracy_score (float): Model accuracy on test set
from src.models.train_model import ModelTraining

# Initialize and run complete training pipeline
trainer = ModelTraining()
results = trainer.run()

# Access results
print(f"Model Accuracy: {results['accuracy_score']:.4f}")
print("\nProbability Distribution:")
print(results['probability_distribution'])
print("\nSample Predictions:")
print(results['predictions_df'].head(10))

# Save predictions to file
results['predictions_df'].to_csv('lead_predictions.csv', index=False)

Example: Complete Workflow

from src.models.train_model import ModelTraining
import pandas as pd

# Initialize the trainer
trainer = ModelTraining()

# Option 1: Run the complete pipeline
results = trainer.run()
print(f"Accuracy: {results['accuracy_score']:.4f}")

# Option 2: Step-by-step execution for more control
X_train, X_test, y_train, y_test = trainer.get_training_data()

# Compare models
scores = trainer.compare_classifiers(X_train, y_train)
for model, score in sorted(scores.items(), key=lambda x: x[1], reverse=True):
    print(f"{model}: {score:.4f}")

# Train best model manually
best_model_name = max(scores, key=scores.get)
print(f"\nTraining {best_model_name}...")

# Access the best model from the models list
best_model_index = [name for name, _ in trainer.models].index(best_model_name)
best_model = trainer.models[best_model_index][1]

# Fit and predict
best_model.fit(X_train, y_train)
y_probabilities = best_model.predict_proba(X_test)
y_predicted = best_model.predict(X_test)

# Get detailed analysis
import numpy as np
y_pred_argmax = np.argmax(y_probabilities, axis=1)
predictions_df, prob_dist = trainer.get_lead_distribution(X_test, y_pred_argmax, y_probabilities)

print("\nPrediction Analysis:")
print(predictions_df.head())

Build docs developers (and LLMs) love