Overview
Data cleaning fixes systematic problems and errors in raw data. This process ensures the dataset is reliable and ready for feature engineering and model training.
Missing Value Handling
Loss Reason Column
The Loss Reason column requires special handling based on the Status value:
Handle 'Closed Lost' status
For offers with Status = “Closed Lost” and null Loss Reason, fill with “no response”:# Fill null values in 'Loss Reason'
# If 'Status' is 'Closed Lost', fill in 'no response'
full_dataset_preprocessed['Loss Reason'] = np.where(
(full_dataset_preprocessed['Status'] == 'Closed Lost') &
(full_dataset_preprocessed['Loss Reason'].isnull()),
'no response',
full_dataset_preprocessed['Loss Reason']
)
This assumes that lost deals without a recorded reason were likely due to no response from the prospect.
Handle 'Closed Won' status
For offers with Status = “Closed Won”, fill missing Loss Reason with the mode:# If 'Status' is 'Closed Won', fill with 'Loss Reason' mode for 'Closed Won'
mode_closed_won = full_dataset_preprocessed.loc[
full_dataset_preprocessed['Status'] == 'Closed Won',
'Loss Reason'
].mode()[0]
full_dataset_preprocessed['Loss Reason'].fillna(mode_closed_won, inplace=True)
Using the mode (most frequent value) for won deals preserves the distribution of this feature.
General Imputation Strategy
After handling Loss Reason, all remaining missing values are imputed based on data type:
for col in full_dataset_preprocessed.columns:
if full_dataset_preprocessed[col].dtype in ['object', 'datetime64[ns]']:
# Categorical and datetime columns: fill with mode
full_dataset_preprocessed[col] = full_dataset_preprocessed[col].fillna(
full_dataset_preprocessed[col].mode()[0]
)
elif full_dataset_preprocessed[col].dtype in ['int64', 'float64', 'int32', 'float32']:
# Numerical columns: fill with mean
full_dataset_preprocessed[col] = full_dataset_preprocessed[col].fillna(
full_dataset_preprocessed[col].mean()
)
Strategy:
- Categorical/Datetime columns: Fill with mode (most frequent value)
- Numerical columns: Fill with mean (average value)
This imputation strategy assumes that missing values are random. If missingness is systematic (e.g., missing prices indicate special cases), this approach may introduce bias.
Data Cleaning Class
The project includes a comprehensive DataCleaning class in src/data/data_cleaning.py with additional cleaning methods:
Available Methods
Inspection Methods
data_cleaning = DataCleaning(main_path="data/raw")
# Load dataset
df = data_cleaning.load_dataset(dataset_path="leads.csv")
# Inspect dataset structure and missing values
data_cleaning.inspect_dataset(df)
The inspect_dataset method provides:
- Dataset shape and overview
- Data types distribution
- Null value counts and proportions
- Visual representation of missing data
Handling Single-Value Columns
Columns with only one unique value don’t provide information for modeling:
# Identify columns with single values
counts = data_cleaning.identify_columns_with_single_value(df)
# Remove zero-variance predictors
df_cleaned = data_cleaning.delete_columns_with_single_value(df, counts)
Zero-variance predictors (columns with a single value) are useless for modeling and should be removed.
Handling Duplicate Rows
Duplicate data can mislead model evaluation:
# Identify duplicate rows
data_cleaning.identify_row_with_duplicate_data(df)
# Remove duplicates
df_cleaned = data_cleaning.delete_row_with_duplicate_data(df)
Dropping Specific Columns
Remove irrelevant or problematic columns:
df_cleaned = data_cleaning.drop_columns(
df,
columns_to_drop=['Id', 'First Name', 'Acquisition Campaign']
)
Flexible Missing Value Handling
The class provides multiple imputation strategies:
# Drop rows with missing values
df_cleaned = data_cleaning.handle_missing_values(
df,
columns=['Id'],
method='drop'
)
# Fill with mean (numerical columns)
df_cleaned = data_cleaning.handle_missing_values(
df,
columns=['Price'],
method='mean'
)
# Fill with median (numerical columns)
df_cleaned = data_cleaning.handle_missing_values(
df,
columns=['Price'],
method='median'
)
# Fill with mode (categorical columns)
df_cleaned = data_cleaning.handle_missing_values(
df,
columns=['City'],
method='mode'
)
Group-Based Imputation
Impute missing values based on group statistics:
# Impute using mode of each group
df_cleaned = data_cleaning.impute_nulls_by_group_using_mode(
df,
group_columns=['Status', 'Source'],
target_column='Loss Reason'
)
# Impute using median of each group
df_cleaned = data_cleaning.impute_nulls_by_group_using_median(
df,
group_columns=['Status', 'Source'],
target_column='Price'
)
Group-based imputation is more sophisticated and preserves relationships between features better than global imputation.
Cleaned Data Output
The preprocessing pipeline produces:
data/interim/leads_data_cleaned.csv - Cleaned leads data
data/interim/full_dataset.csv - Merged but not fully preprocessed
data/processed/full_dataset.csv - Final cleaned and preprocessed dataset
Summary
The data cleaning pipeline:
- Removes rows with null values in critical columns (Id)
- Drops columns with excessive missing data (>80% null)
- Applies conditional imputation to
Loss Reason based on Status
- Fills categorical columns with mode
- Fills numerical columns with mean
- Removes duplicate rows (if any)
- Eliminates zero-variance predictors
This ensures a clean, complete dataset ready for feature engineering.