Quickstart

Prerequisites

Before you begin, ensure you have:

Python 3.7 or higher installed
Completed the installation steps
Raw data files (leads.csv and offers.csv) in the data/raw/ directory

Data Preprocessing

The first step is to preprocess your raw lead and offer data. This process handles data fusion, missing values, feature engineering, and encoding.

Run the preprocessing script

Execute the data preprocessing module to prepare your datasets:

python3 src.eda.data_preprocessing.py

This script performs the following operations:

Data Fusion: Merges leads.csv and offers.csv using unique identifiers
Missing Data Handling:
- Removes rows with null IDs
- Imputes categorical columns with mode values
- Imputes numerical columns with mean values
- Fills Loss Reason based on Status values
Feature Engineering:
- Extracts year and month from date columns
- Drops irrelevant columns (First Name, duplicate Use Case, etc.)
Label Encoding: Converts categorical features to numerical format
Target Mapping: Groups minority classes into ‘Other’ category

Verify processed data

After preprocessing, verify that the processed dataset was created:

ls data/processed/

You should see full_dataset.csv containing the cleaned and encoded data.

The preprocessing script automatically handles class imbalance by grouping minority status classes into a new ‘Other’ category, creating three balanced target classes: ‘Closed Won’, ‘Closed Lost’, and ‘Other’.

Model Training

Now train and evaluate multiple classification models to find the best performer.

Train the models

Run the training script to compare 12 different classifiers:

python3 -m src.models.train_model

The training process:

Loads the processed dataset from data/processed/full_dataset.csv
Splits data into 80% training and 20% testing sets
Applies StandardScaler to numerical features (Price and Discount code)
Trains 12 models using 5-fold cross-validation
Selects the best model based on cross-validation scores
Evaluates performance on the test set

Review model performance

The script outputs cross-validation scores for all models:

Model scores:
RandomForest: 0.87
Adaboost: 0.85
ExtraTree: 0.88
BaggingClassifier: 0.84
GradientBoosting: 0.91
DecisionTree: 0.78
NaiveBayes: 0.72
KNN: 0.81
Logistic: 0.83
SGD Classifier: 0.80
MLPClassifier: 0.86
SVM: 0.88

Best Model: GradientBoosting with Score: 0.91
Accuracy Score: 0.904

The Gradient Boosting model achieves the highest performance with 90.4% accuracy.

Understanding the Model Pipeline

Here’s how the training pipeline works under the hood:

# Load processed data
data = pd.read_csv("data/processed/full_dataset.csv")

# Split features and target
class_label = 'Status'
X = data.drop([class_label], axis=1)
y = data[class_label]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    random_state=42, 
    shuffle=True, 
    test_size=0.2
)

Model Configuration

The Gradient Boosting Classifier uses these hyperparameters:

GradientBoostingClassifier(
    random_state=42  # For reproducibility
)

All models are initialized with random_state=42 to ensure reproducible results across training runs.

Deploy with Shimoku Dashboard

Visualize predictions and model performance using the Shimoku API integration.

Set environment variables

Configure your Shimoku credentials:

export SHIMOKU_TOKEN="your_access_token"
export SHIMOKU_UNIVERSE_ID="your_universe_id"
export SHIMOKU_WORKSPACE_ID="your_workspace_id"

Run the application

Launch the dashboard application:

python3 src/app.py

This creates an interactive dashboard displaying:

Lead conversion predictions with probability scores
Predicted class distribution across test data
Model accuracy metrics

Next Steps

Model Details

Learn about the 12 classification algorithms evaluated

Data Schema

Understand the input features and target variables

Get Started

Core Concepts

Data Preparation

Prerequisites

Data Preprocessing

Model Training

Understanding the Model Pipeline

Model Configuration

Deploy with Shimoku Dashboard

Next Steps

Model Details

Data Schema

Build docs developers (and LLMs) love

Get Started

Core Concepts

Data Preparation

​Prerequisites

​Data Preprocessing

​Model Training

​Understanding the Model Pipeline

​Model Configuration

​Deploy with Shimoku Dashboard

​Next Steps

Model Details

Data Schema

Build docs developers (and LLMs) love

Prerequisites

Data Preprocessing

Model Training

Understanding the Model Pipeline

Model Configuration

Deploy with Shimoku Dashboard

Next Steps