The preprocessing pipeline transforms raw sales data into a clean, model-ready dataset through a series of systematic transformations. The complete process is implemented in src/data/data_preprocessing.py.
1
Data Loading
Load raw CSV files from data/raw/ directory
2
Data Cleaning
Remove null values, drop irrelevant columns
3
Data Fusion
Merge leads and offers datasets
4
Feature Engineering
Extract temporal features, handle missing values
5
Target Mapping
Consolidate minority classes
6
Encoding
Transform categorical variables to numerical format
7
Output
Save processed dataset to data/processed/full_dataset.csv
The leads dataset undergoes initial cleaning to ensure data quality and remove redundancy:
# Delete rows with null values in the 'Id' columnleads_data_cleaned = leads_data.dropna(subset=['Id'])# Drop multiple columnsleads_data_cleaned = leads_data_cleaned.drop([ 'First Name', # Irrelevant PII 'Use Case', # Duplicates offers.csv field 'Created Date', # Duplicates offers.csv field 'Status', # Refers to offers.csv Status 'Converted' # Less granular than offers.csv Status], axis=1)
The cleaned leads dataset is saved to data/interim/leads_data_cleaned.csv for inspection and debugging.
Rationale for dropped columns:
First Name
Use Case / Created Date
Status / Converted
Reason: Personally identifiable information (PII) with no predictive value.Individual names don’t generalize to predict conversion patterns.
Reason: Duplicate information.These fields exist in both datasets. The offers.csv versions are kept as they represent the offer-stage data, which is closer to the conversion decision.
Reason: Target variable redundancy.The Status field from offers.csv provides more detailed outcome information (Closed Won, Closed Lost, Other) compared to the binary Converted field.
The cleaned datasets are merged using a left join:
# Merge the datasets using the 'Id' column as a keyfull_dataset = pd.merge(offers_data, leads_data_cleaned, on='Id', how='left')full_dataset.to_csv("data/interim/full_dataset.csv", index=False)
Left join strategy: All records from offers_data are retained. This ensures every offer is included in the model, even if lead information is missing (handled through imputation).
# Drop columns with excessive missing data or no predictive valuefull_dataset_preprocessed = full_dataset.drop([ 'Id', # Not predictive 'Discarded/Nurturing Reason', # >80% null values 'Acquisition Campaign' # >80% null values], axis=1)
Columns with more than 80% null values are dropped as the missing data is too significant to provide reliable information for modeling.
Special logic handles the Loss Reason field based on Status:
# If 'Status' is 'Closed Lost' and Loss Reason is null, fill with 'no response'full_dataset_preprocessed['Loss Reason'] = np.where( (full_dataset_preprocessed['Status'] == 'Closed Lost') & (full_dataset_preprocessed['Loss Reason'].isnull()), 'no response', full_dataset_preprocessed['Loss Reason'])# If 'Status' is 'Closed Won', fill with mode for 'Closed Won' casesmode_closed_won = full_dataset_preprocessed.loc[ full_dataset_preprocessed['Status'] == 'Closed Won', 'Loss Reason'].mode()[0]full_dataset_preprocessed['Loss Reason'].fillna(mode_closed_won, inplace=True)
Why conditional imputation?
Loss Reason has different semantics based on the outcome:
Closed Lost: Missing Loss Reason likely means “no response” from the client
Closed Won: Use the most common Loss Reason value from other won cases (though this is somewhat paradoxical, it maintains data completeness)
This context-aware imputation preserves the relationship between Status and Loss Reason.
Remaining missing values are handled automatically by data type:
for col in full_dataset_preprocessed.columns: # Categorical and datetime columns → mode if full_dataset_preprocessed[col].dtype in ['object', 'datetime64[ns]']: full_dataset_preprocessed[col] = full_dataset_preprocessed[col].fillna( full_dataset_preprocessed[col].mode()[0] ) # Numerical columns → mean elif full_dataset_preprocessed[col].dtype in ['int64', 'float64', 'int32', 'float32']: full_dataset_preprocessed[col] = full_dataset_preprocessed[col].fillna( full_dataset_preprocessed[col].mean() )
Categorical Features
Imputation: Mode (most frequent value)Preserves the distribution of categorical variables.
Numerical Features
Imputation: Mean (average value)Maintains the central tendency of continuous variables.
To address class imbalance, minority classes in the Status field are consolidated:
# Define mapping for main classesclase_mapping = { 'Closed Won': 'Closed Won', 'Closed Lost': 'Closed Lost'}# Assign 'Other' to all classes not in the mappingfull_dataset_preprocessed['Status'] = full_dataset_preprocessed['Status'].map( clase_mapping).fillna('Other')
Result: Three-class target variable:
Closed Won - Successful conversions
Closed Lost - Failed conversions
Other - All minority status categories
This transformation addresses class imbalance while preserving the critical distinction between won and lost opportunities.
All categorical variables are transformed to numerical format using Label Encoding:
from sklearn.preprocessing import LabelEncoder# Create encoder instancelabel_encoder = LabelEncoder()# Select categorical columnscategorical_columns = full_dataset_preprocessed.select_dtypes( ['object', 'datetime64[ns]']).columnscategorical_columns = list(set(categorical_columns))# Apply LabelEncoder to each categorical columnfor column in categorical_columns: if column in full_dataset_preprocessed.columns: full_dataset_preprocessed[column] = label_encoder.fit_transform( full_dataset_preprocessed[column] )
Encoded columns:
Source
City
Use Case
Pain
Loss Reason
Status (target variable)
Discount code (treated as categorical)
Important: Each column gets its own encoder instance via fit_transform(). This means the encoding is independent per column, which is suitable for tree-based models like Gradient Boosting.For production deployment, encoders should be saved and reused to ensure consistent encoding of new data.
The preprocessed data feeds into the model training pipeline:
# From src/models/train_model.py# Load processed datasetdata = pd.read_csv("data/processed/full_dataset.csv")# Split features and targetclass_label = 'Status'X = data.drop([class_label], axis=1)y = data[class_label]# Train-test splitX_train, X_test, y_train, y_test = train_test_split( X, y, random_state=42, shuffle=True, test_size=0.2)# Additional scaling in pipelinect = ColumnTransformer([ ('se', StandardScaler(), ['Price', 'Discount code'])], remainder='passthrough')pipeline = Pipeline([ ('transformer', ct), ('model', GradientBoostingClassifier(random_state=42))])
Additional preprocessing at training time: Numerical features (Price, Discount code) are scaled using StandardScaler within the model pipeline. This ensures proper scaling is applied during both training and inference.
The src/data/data_cleaning.py module provides a reusable DataCleaning class with methods for common preprocessing tasks:
load_dataset()
def load_dataset(self, dataset_path): """Load dataset and save them as dataframes.""" dataframe = pd.read_csv(f"{self.main_path}/{dataset_path}") return dataframe
Loads CSV files from specified paths.
inspect_dataset()
def inspect_dataset(self, dataframe): """Inspect information from dataset""" print(dataframe.shape) print(dataframe.head()) dataframe.info() print(dataframe.isnull().sum())
Comprehensive dataset inspection including shape, preview, types, and null counts.
handle_missing_values()
def handle_missing_values(self, dataframe, columns, method='drop'): """Handle null values in a DataFrame. Parameters: method: 'drop', 'mean', 'median', or 'mode' """
Flexible missing value handling with multiple strategies.
drop_columns()
def drop_columns(self, dataframe, columns_to_drop): """Remove specific columns from a DataFrame.""" return dataframe.drop(columns=columns_to_drop, errors='ignore')