Prerequisites
Before you begin, ensure you have:- Python 3.7 or higher installed
- Completed the installation steps
- Raw data files (
leads.csvandoffers.csv) in thedata/raw/directory
Data Preprocessing
The first step is to preprocess your raw lead and offer data. This process handles data fusion, missing values, feature engineering, and encoding.Run the preprocessing script
Execute the data preprocessing module to prepare your datasets:This script performs the following operations:
- Data Fusion: Merges
leads.csvandoffers.csvusing unique identifiers - Missing Data Handling:
- Removes rows with null IDs
- Imputes categorical columns with mode values
- Imputes numerical columns with mean values
- Fills
Loss Reasonbased onStatusvalues
- Feature Engineering:
- Extracts year and month from date columns
- Drops irrelevant columns (First Name, duplicate Use Case, etc.)
- Label Encoding: Converts categorical features to numerical format
- Target Mapping: Groups minority classes into ‘Other’ category
The preprocessing script automatically handles class imbalance by grouping minority status classes into a new ‘Other’ category, creating three balanced target classes: ‘Closed Won’, ‘Closed Lost’, and ‘Other’.
Model Training
Now train and evaluate multiple classification models to find the best performer.Train the models
Run the training script to compare 12 different classifiers:The training process:
- Loads the processed dataset from
data/processed/full_dataset.csv - Splits data into 80% training and 20% testing sets
- Applies StandardScaler to numerical features (
PriceandDiscount code) - Trains 12 models using 5-fold cross-validation
- Selects the best model based on cross-validation scores
- Evaluates performance on the test set
Understanding the Model Pipeline
Here’s how the training pipeline works under the hood:Model Configuration
The Gradient Boosting Classifier uses these hyperparameters:random_state=42 to ensure reproducible results across training runs.
Deploy with Shimoku Dashboard
Visualize predictions and model performance using the Shimoku API integration.Next Steps
Model Details
Learn about the 12 classification algorithms evaluated
Data Schema
Understand the input features and target variables