Skip to main content

Prerequisites

Before you begin, ensure you have:
  • Python 3.7 or higher installed
  • Completed the installation steps
  • Raw data files (leads.csv and offers.csv) in the data/raw/ directory

Data Preprocessing

The first step is to preprocess your raw lead and offer data. This process handles data fusion, missing values, feature engineering, and encoding.
1

Run the preprocessing script

Execute the data preprocessing module to prepare your datasets:
python3 src.eda.data_preprocessing.py
This script performs the following operations:
  • Data Fusion: Merges leads.csv and offers.csv using unique identifiers
  • Missing Data Handling:
    • Removes rows with null IDs
    • Imputes categorical columns with mode values
    • Imputes numerical columns with mean values
    • Fills Loss Reason based on Status values
  • Feature Engineering:
    • Extracts year and month from date columns
    • Drops irrelevant columns (First Name, duplicate Use Case, etc.)
  • Label Encoding: Converts categorical features to numerical format
  • Target Mapping: Groups minority classes into ‘Other’ category
2

Verify processed data

After preprocessing, verify that the processed dataset was created:
ls data/processed/
You should see full_dataset.csv containing the cleaned and encoded data.
The preprocessing script automatically handles class imbalance by grouping minority status classes into a new ‘Other’ category, creating three balanced target classes: ‘Closed Won’, ‘Closed Lost’, and ‘Other’.

Model Training

Now train and evaluate multiple classification models to find the best performer.
1

Train the models

Run the training script to compare 12 different classifiers:
python3 -m src.models.train_model
The training process:
  1. Loads the processed dataset from data/processed/full_dataset.csv
  2. Splits data into 80% training and 20% testing sets
  3. Applies StandardScaler to numerical features (Price and Discount code)
  4. Trains 12 models using 5-fold cross-validation
  5. Selects the best model based on cross-validation scores
  6. Evaluates performance on the test set
2

Review model performance

The script outputs cross-validation scores for all models:
Model scores:
RandomForest: 0.87
Adaboost: 0.85
ExtraTree: 0.88
BaggingClassifier: 0.84
GradientBoosting: 0.91
DecisionTree: 0.78
NaiveBayes: 0.72
KNN: 0.81
Logistic: 0.83
SGD Classifier: 0.80
MLPClassifier: 0.86
SVM: 0.88

Best Model: GradientBoosting with Score: 0.91
Accuracy Score: 0.904
The Gradient Boosting model achieves the highest performance with 90.4% accuracy.

Understanding the Model Pipeline

Here’s how the training pipeline works under the hood:
# Load processed data
data = pd.read_csv("data/processed/full_dataset.csv")

# Split features and target
class_label = 'Status'
X = data.drop([class_label], axis=1)
y = data[class_label]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    random_state=42, 
    shuffle=True, 
    test_size=0.2
)

Model Configuration

The Gradient Boosting Classifier uses these hyperparameters:
GradientBoostingClassifier(
    random_state=42  # For reproducibility
)
All models are initialized with random_state=42 to ensure reproducible results across training runs.

Deploy with Shimoku Dashboard

Visualize predictions and model performance using the Shimoku API integration.
1

Set environment variables

Configure your Shimoku credentials:
export SHIMOKU_TOKEN="your_access_token"
export SHIMOKU_UNIVERSE_ID="your_universe_id"
export SHIMOKU_WORKSPACE_ID="your_workspace_id"
2

Run the application

Launch the dashboard application:
python3 src/app.py
This creates an interactive dashboard displaying:
  • Lead conversion predictions with probability scores
  • Predicted class distribution across test data
  • Model accuracy metrics

Next Steps

Model Details

Learn about the 12 classification algorithms evaluated

Data Schema

Understand the input features and target variables

Build docs developers (and LLMs) love