Skip to main content

Overview

The data preprocessing module (data_preprocessing.py) performs feature engineering, data cleaning, and transformation on raw lead and offer datasets. It executes as a script and generates cleaned intermediate and processed datasets.

Input Data

The preprocessing pipeline expects two CSV files in the data/raw/ directory:
leads.csv
CSV file
required
Raw leads dataset containing:
  • Id: Lead identifier
  • First Name: Lead first name
  • Use Case: Lead use case
  • Created Date: Lead creation date
  • Status: Lead status
  • Converted: Conversion flag
  • Additional lead attributes
offers.csv
CSV file
required
Raw offers dataset containing:
  • Id: Offer identifier (matches lead Id)
  • Use Case: Offer use case
  • Created Date: Offer creation date
  • Close Date: Offer close date
  • Status: Offer status (Closed Won, Closed Lost, etc.)
  • Price: Offer price
  • Discount code: Discount applied
  • Loss Reason: Reason for lost deals
  • Source: Lead source
  • City: City location
  • Discarded/Nurturing Reason: Reason for discarding
  • Acquisition Campaign: Marketing campaign

Processing Steps

1. Data Loading

import pandas as pd

leads_data = pd.read_csv("data/raw/leads.csv")
offers_data = pd.read_csv("data/raw/offers.csv")

2. Initial Cleaning (Leads Dataset)

Null Value Removal:
# Delete rows with null values in the 'Id' column
leads_data_cleaned = leads_data.dropna(subset=['Id'])
Column Removal: Drops redundant and irrelevant columns from the leads dataset:
  • First Name - Irrelevant for prediction
  • Use Case - Duplicate of offers dataset column
  • Created Date - Duplicate of offers dataset column
  • Status - Superseded by offers status
  • Converted - Superseded by offers status
leads_data_cleaned = leads_data_cleaned.drop(
    ['First Name', 'Use Case', 'Created Date', 'Status', 'Converted'], 
    axis=1
)
Output: data/interim/leads_data_cleaned.csv

3. Dataset Merging

Merges offers and cleaned leads data using left join on the Id column:
full_dataset = pd.merge(offers_data, leads_data_cleaned, on='Id', how='left')
Output: data/interim/full_dataset.csv

4. Feature Engineering

Remove High-Missing Columns: Drops columns with >80% missing values and the Id column:
  • Id - Identifier not needed for training
  • Discarded/Nurturing Reason - >80% null values
  • Acquisition Campaign - >80% null values
full_dataset_preprocessed = full_dataset.drop(
    ['Id', 'Discarded/Nurturing Reason', 'Acquisition Campaign'], 
    axis=1
)
Date Parsing: Converts date strings to datetime objects:
full_dataset_preprocessed['Created Date'] = pd.to_datetime(
    full_dataset_preprocessed['Created Date'], 
    format="%Y-%m-%d"
)
full_dataset_preprocessed['Close Date'] = pd.to_datetime(
    full_dataset_preprocessed['Close Date'], 
    format="%Y-%m-%d"
)

5. Missing Value Imputation

Loss Reason Column: Context-aware filling based on Status:
import numpy as np

# For 'Closed Lost' with null Loss Reason, fill with 'no response'
full_dataset_preprocessed['Loss Reason'] = np.where(
    (full_dataset_preprocessed['Status'] == 'Closed Lost') & 
    (full_dataset_preprocessed['Loss Reason'].isnull()), 
    'no response', 
    full_dataset_preprocessed['Loss Reason']
)

# For 'Closed Won', fill with mode of 'Closed Won' Loss Reasons
mode_closed_won = full_dataset_preprocessed.loc[
    full_dataset_preprocessed['Status'] == 'Closed Won', 
    'Loss Reason'
].mode()[0]
full_dataset_preprocessed['Loss Reason'].fillna(mode_closed_won, inplace=True)
General Imputation: Fills remaining missing values by data type:
for col in full_dataset_preprocessed.columns:
    if full_dataset_preprocessed[col].dtype in ['object', 'datetime64[ns]']:
        # Fill categorical/datetime with mode
        full_dataset_preprocessed[col] = full_dataset_preprocessed[col].fillna(
            full_dataset_preprocessed[col].mode()[0]
        )
    elif full_dataset_preprocessed[col].dtype in ['int64', 'float64', 'int32', 'float32']:
        # Fill numeric with mean
        full_dataset_preprocessed[col] = full_dataset_preprocessed[col].fillna(
            full_dataset_preprocessed[col].mean()
        )

6. Date Feature Extraction

Extracts year and month from date columns:
full_dataset_preprocessed['Created Year'] = full_dataset_preprocessed['Created Date'].dt.year
full_dataset_preprocessed['Created Month'] = full_dataset_preprocessed['Created Date'].dt.month
full_dataset_preprocessed['Close Year'] = full_dataset_preprocessed['Close Date'].dt.year
full_dataset_preprocessed['Close Month'] = full_dataset_preprocessed['Close Date'].dt.month

# Drop original date columns
full_dataset_preprocessed = full_dataset_preprocessed.drop(
    ['Created Date', 'Close Date'], 
    axis=1
)

7. Target Class Mapping

Simplifies the Status column to three classes:
clase_mapping = {'Closed Won': 'Closed Won', 'Closed Lost': 'Closed Lost'}
full_dataset_preprocessed['Status'] = full_dataset_preprocessed['Status'].map(
    clase_mapping
).fillna('Other')  # All other statuses mapped to 'Other'

8. Label Encoding

Encodes all categorical columns to numeric values:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

# Get all categorical columns
categorical_columns = full_dataset_preprocessed.select_dtypes(
    ['object', 'datetime64[ns]']
).columns
categorical_columns = list(set(categorical_columns))

# Apply LabelEncoder to each categorical column
for column in categorical_columns:
    if column in full_dataset_preprocessed.columns:
        full_dataset_preprocessed[column] = label_encoder.fit_transform(
            full_dataset_preprocessed[column]
        )

9. Final Output

Output: data/processed/full_dataset.csv The final processed dataset contains:
  • All numeric features (encoded categorical variables)
  • Extracted temporal features (Created Year/Month, Close Year/Month)
  • Cleaned and imputed values
  • Simplified target variable (Status)

Output Files

data/interim/leads_data_cleaned.csv
CSV
Cleaned leads data with null Id rows removed and redundant columns dropped
data/interim/full_dataset.csv
CSV
Merged dataset combining offers and leads data with original values preserved
data/processed/full_dataset.csv
CSV
Fully processed dataset ready for model training:
  • No missing values
  • All categorical features encoded
  • Date features extracted
  • Target variable mapped to 3 classes

Running the Preprocessing

The preprocessing script is designed to be executed directly:
cd ~/workspace/source
python src/data/data_preprocessing.py
Prerequisites:
  • Raw data files must exist in data/raw/ directory
  • Required Python packages: pandas, numpy, scikit-learn, matplotlib, seaborn, missingno
Expected Output:
data/interim/leads_data_cleaned.csv (created)
data/interim/full_dataset.csv (created)
data/processed/full_dataset.csv (created)

Example: Custom Preprocessing

If you need to adapt the preprocessing for your own data:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

# Load your data
leads = pd.read_csv("data/raw/leads.csv")
offers = pd.read_csv("data/raw/offers.csv")

# Clean leads
leads_clean = leads.dropna(subset=['Id'])
leads_clean = leads_clean.drop(['First Name', 'Use Case', 'Created Date', 'Status', 'Converted'], axis=1)

# Merge datasets
full_data = pd.merge(offers, leads_clean, on='Id', how='left')

# Drop high-missing and irrelevant columns
full_data = full_data.drop(['Id', 'Discarded/Nurturing Reason', 'Acquisition Campaign'], axis=1)

# Convert dates
full_data['Created Date'] = pd.to_datetime(full_data['Created Date'], format="%Y-%m-%d")
full_data['Close Date'] = pd.to_datetime(full_data['Close Date'], format="%Y-%m-%d")

# Extract date features
full_data['Created Year'] = full_data['Created Date'].dt.year
full_data['Created Month'] = full_data['Created Date'].dt.month
full_data['Close Year'] = full_data['Close Date'].dt.year
full_data['Close Month'] = full_data['Close Date'].dt.month
full_data = full_data.drop(['Created Date', 'Close Date'], axis=1)

# Map target classes
clase_mapping = {'Closed Won': 'Closed Won', 'Closed Lost': 'Closed Lost'}
full_data['Status'] = full_data['Status'].map(clase_mapping).fillna('Other')

# Encode categorical variables
le = LabelEncoder()
for col in full_data.select_dtypes(['object']).columns:
    full_data[col] = le.fit_transform(full_data[col])

# Save processed data
full_data.to_csv("data/processed/custom_dataset.csv", index=False)
print(f"Processed dataset shape: {full_data.shape}")

Build docs developers (and LLMs) love