Skip to main content

Pipeline Overview

The preprocessing pipeline transforms raw sales data into a clean, model-ready dataset through a series of systematic transformations. The complete process is implemented in src/data/data_preprocessing.py.
1

Data Loading

Load raw CSV files from data/raw/ directory
2

Data Cleaning

Remove null values, drop irrelevant columns
3

Data Fusion

Merge leads and offers datasets
4

Feature Engineering

Extract temporal features, handle missing values
5

Target Mapping

Consolidate minority classes
6

Encoding

Transform categorical variables to numerical format
7

Output

Save processed dataset to data/processed/full_dataset.csv

Step 1: Data Loading

The pipeline begins by loading the two raw datasets:
import pandas as pd

# Load raw datasets
leads_data = pd.read_csv("data/raw/leads.csv")
offers_data = pd.read_csv("data/raw/offers.csv")
File locations:
  • data/raw/leads.csv - All potential clients
  • data/raw/offers.csv - Clients who reached demo stage

Step 2: Data Cleaning

Leads Dataset Cleaning

The leads dataset undergoes initial cleaning to ensure data quality and remove redundancy:
# Delete rows with null values in the 'Id' column
leads_data_cleaned = leads_data.dropna(subset=['Id'])

# Drop multiple columns
leads_data_cleaned = leads_data_cleaned.drop([
    'First Name',      # Irrelevant PII
    'Use Case',        # Duplicates offers.csv field
    'Created Date',    # Duplicates offers.csv field
    'Status',          # Refers to offers.csv Status
    'Converted'        # Less granular than offers.csv Status
], axis=1)
The cleaned leads dataset is saved to data/interim/leads_data_cleaned.csv for inspection and debugging.
Rationale for dropped columns:
Reason: Personally identifiable information (PII) with no predictive value.Individual names don’t generalize to predict conversion patterns.

Step 3: Data Fusion

The cleaned datasets are merged using a left join:
# Merge the datasets using the 'Id' column as a key
full_dataset = pd.merge(offers_data, leads_data_cleaned, on='Id', how='left')
full_dataset.to_csv("data/interim/full_dataset.csv", index=False)
Left join strategy: All records from offers_data are retained. This ensures every offer is included in the model, even if lead information is missing (handled through imputation).

Post-Merge Cleanup

After merging, additional columns are removed:
# Drop columns with excessive missing data or no predictive value
full_dataset_preprocessed = full_dataset.drop([
    'Id',                        # Not predictive
    'Discarded/Nurturing Reason', # >80% null values
    'Acquisition Campaign'        # >80% null values
], axis=1)
Columns with more than 80% null values are dropped as the missing data is too significant to provide reliable information for modeling.

Step 4: Feature Engineering

Temporal Feature Extraction

Date fields are parsed and decomposed into temporal components:
# Parse dates
full_dataset_preprocessed['Created Date'] = pd.to_datetime(
    full_dataset_preprocessed['Created Date'], 
    format="%Y-%m-%d"
)
full_dataset_preprocessed['Close Date'] = pd.to_datetime(
    full_dataset_preprocessed['Close Date'], 
    format="%Y-%m-%d"
)

# Extract temporal features
full_dataset_preprocessed['Created Year'] = full_dataset_preprocessed['Created Date'].dt.year
full_dataset_preprocessed['Created Month'] = full_dataset_preprocessed['Created Date'].dt.month
full_dataset_preprocessed['Close Year'] = full_dataset_preprocessed['Close Date'].dt.year
full_dataset_preprocessed['Close Month'] = full_dataset_preprocessed['Close Date'].dt.month

# Drop original date columns
full_dataset_preprocessed = full_dataset_preprocessed.drop(
    ['Created Date', 'Close Date'], 
    axis=1
)
Benefits of temporal decomposition:
  • Captures seasonality patterns (monthly trends)
  • Captures yearly trends
  • Converts dates to numerical format for ML algorithms
  • Maintains temporal information without high cardinality

Missing Value Imputation

Loss Reason - Conditional Imputation

Special logic handles the Loss Reason field based on Status:
# If 'Status' is 'Closed Lost' and Loss Reason is null, fill with 'no response'
full_dataset_preprocessed['Loss Reason'] = np.where(
    (full_dataset_preprocessed['Status'] == 'Closed Lost') & 
    (full_dataset_preprocessed['Loss Reason'].isnull()),
    'no response',
    full_dataset_preprocessed['Loss Reason']
)

# If 'Status' is 'Closed Won', fill with mode for 'Closed Won' cases
mode_closed_won = full_dataset_preprocessed.loc[
    full_dataset_preprocessed['Status'] == 'Closed Won', 
    'Loss Reason'
].mode()[0]
full_dataset_preprocessed['Loss Reason'].fillna(mode_closed_won, inplace=True)
Loss Reason has different semantics based on the outcome:
  • Closed Lost: Missing Loss Reason likely means “no response” from the client
  • Closed Won: Use the most common Loss Reason value from other won cases (though this is somewhat paradoxical, it maintains data completeness)
This context-aware imputation preserves the relationship between Status and Loss Reason.

General Imputation Strategy

Remaining missing values are handled automatically by data type:
for col in full_dataset_preprocessed.columns:
    # Categorical and datetime columns → mode
    if full_dataset_preprocessed[col].dtype in ['object', 'datetime64[ns]']:
        full_dataset_preprocessed[col] = full_dataset_preprocessed[col].fillna(
            full_dataset_preprocessed[col].mode()[0]
        )
    # Numerical columns → mean
    elif full_dataset_preprocessed[col].dtype in ['int64', 'float64', 'int32', 'float32']:
        full_dataset_preprocessed[col] = full_dataset_preprocessed[col].fillna(
            full_dataset_preprocessed[col].mean()
        )

Categorical Features

Imputation: Mode (most frequent value)Preserves the distribution of categorical variables.

Numerical Features

Imputation: Mean (average value)Maintains the central tendency of continuous variables.

Step 5: Target Variable Mapping

To address class imbalance, minority classes in the Status field are consolidated:
# Define mapping for main classes
clase_mapping = {
    'Closed Won': 'Closed Won',
    'Closed Lost': 'Closed Lost'
}

# Assign 'Other' to all classes not in the mapping
full_dataset_preprocessed['Status'] = full_dataset_preprocessed['Status'].map(
    clase_mapping
).fillna('Other')
Result: Three-class target variable:
  • Closed Won - Successful conversions
  • Closed Lost - Failed conversions
  • Other - All minority status categories
This transformation addresses class imbalance while preserving the critical distinction between won and lost opportunities.

Step 6: Label Encoding

All categorical variables are transformed to numerical format using Label Encoding:
from sklearn.preprocessing import LabelEncoder

# Create encoder instance
label_encoder = LabelEncoder()

# Select categorical columns
categorical_columns = full_dataset_preprocessed.select_dtypes(
    ['object', 'datetime64[ns]']
).columns
categorical_columns = list(set(categorical_columns))

# Apply LabelEncoder to each categorical column
for column in categorical_columns:
    if column in full_dataset_preprocessed.columns:
        full_dataset_preprocessed[column] = label_encoder.fit_transform(
            full_dataset_preprocessed[column]
        )
Encoded columns:
  • Source
  • City
  • Use Case
  • Pain
  • Loss Reason
  • Status (target variable)
  • Discount code (treated as categorical)
Important: Each column gets its own encoder instance via fit_transform(). This means the encoding is independent per column, which is suitable for tree-based models like Gradient Boosting.For production deployment, encoders should be saved and reused to ensure consistent encoding of new data.

Step 7: Final Output

The fully preprocessed dataset is saved:
full_dataset_preprocessed.to_csv("data/processed/full_dataset.csv", index=False)
Output location: data/processed/full_dataset.csv This file is ready for model training with:
  • No missing values
  • All categorical variables encoded
  • Temporal features extracted
  • Target variable properly formatted
  • No irrelevant or redundant columns

Model Training Pipeline Integration

The preprocessed data feeds into the model training pipeline:
# From src/models/train_model.py

# Load processed dataset
data = pd.read_csv("data/processed/full_dataset.csv")

# Split features and target
class_label = 'Status'
X = data.drop([class_label], axis=1)
y = data[class_label]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    random_state=42, 
    shuffle=True, 
    test_size=0.2
)

# Additional scaling in pipeline
ct = ColumnTransformer([
    ('se', StandardScaler(), ['Price', 'Discount code'])
], remainder='passthrough')

pipeline = Pipeline([
    ('transformer', ct),
    ('model', GradientBoostingClassifier(random_state=42))
])
Additional preprocessing at training time: Numerical features (Price, Discount code) are scaled using StandardScaler within the model pipeline. This ensures proper scaling is applied during both training and inference.

Complete Preprocessing Summary

OperationDetails
Null HandlingDrop rows with null Id; impute other nulls
Column RemovalDrop 8 columns (PII, duplicates, high nulls)
Data FusionLeft join offers → leads on Id
Temporal FeaturesExtract year/month from dates
Target Mapping3-class Status (Closed Won/Lost/Other)
EncodingLabel encode all categorical features
ScalingStandardScaler for Price, Discount code

Data Cleaning Utilities

The src/data/data_cleaning.py module provides a reusable DataCleaning class with methods for common preprocessing tasks:
def load_dataset(self, dataset_path):
    """Load dataset and save them as dataframes."""
    dataframe = pd.read_csv(f"{self.main_path}/{dataset_path}")
    return dataframe
Loads CSV files from specified paths.
def inspect_dataset(self, dataframe):
    """Inspect information from dataset"""
    print(dataframe.shape)
    print(dataframe.head())
    dataframe.info()
    print(dataframe.isnull().sum())
Comprehensive dataset inspection including shape, preview, types, and null counts.
def handle_missing_values(self, dataframe, columns, method='drop'):
    """Handle null values in a DataFrame.
    
    Parameters:
        method: 'drop', 'mean', 'median', or 'mode'
    """
Flexible missing value handling with multiple strategies.
def drop_columns(self, dataframe, columns_to_drop):
    """Remove specific columns from a DataFrame."""
    return dataframe.drop(columns=columns_to_drop, errors='ignore')
Safely remove specified columns.

Running the Preprocessing Pipeline

To execute the preprocessing pipeline:
python3 src.data.data_preprocessing.py
This generates:
  • data/interim/leads_data_cleaned.csv
  • data/interim/full_dataset.csv
  • data/processed/full_dataset.csv (final output)

Next: Model Training

Learn how to train and evaluate the lead scoring models

Build docs developers (and LLMs) love