Feature Engineering

Overview

Feature engineering transforms raw data into meaningful features that improve model performance. This process includes date extraction, categorical encoding, and feature scaling.

Date Feature Extraction

Converting to DateTime

First, date columns are converted from strings to datetime objects:

full_dataset_preprocessed['Created Date'] = pd.to_datetime(
    full_dataset_preprocessed['Created Date'], 
    format="%Y-%m-%d"
)
full_dataset_preprocessed['Close Date'] = pd.to_datetime(
    full_dataset_preprocessed['Close Date'], 
    format="%Y-%m-%d"
)

Extracting Year and Month

Temporal features are extracted from the datetime columns:

full_dataset_preprocessed['Created Year'] = full_dataset_preprocessed['Created Date'].dt.year
full_dataset_preprocessed['Created Month'] = full_dataset_preprocessed['Created Date'].dt.month
full_dataset_preprocessed['Close Year'] = full_dataset_preprocessed['Close Date'].dt.year
full_dataset_preprocessed['Close Month'] = full_dataset_preprocessed['Close Date'].dt.month

This creates four new features:

Created Year: Year when the offer was created
Created Month: Month when the offer was created
Close Year: Year when the offer was closed
Close Month: Month when the offer was closed

Extracting year and month as separate features allows the model to capture seasonal patterns and temporal trends in lead conversion.

Dropping Original Date Columns

After extraction, the original datetime columns are removed:

full_dataset_preprocessed = full_dataset_preprocessed.drop(
    ['Created Date', 'Close Date'], 
    axis=1
)

Target Variable Mapping

Status Column Transformation

The Status column contains the target variable. Minority classes are grouped to address class imbalance:

clase_mapping = {'Closed Won': 'Closed Won', 'Closed Lost': 'Closed Lost'}
# Assign 'Other' to all classes that are not 'Closed Won' or 'Closed Lost'
full_dataset_preprocessed['Status'] = full_dataset_preprocessed['Status'].map(clase_mapping).fillna('Other')

This creates three target classes:

Closed Won: Successfully converted leads
Closed Lost: Lost opportunities
Other: All other statuses (e.g., In Progress, Nurturing)

Grouping minority classes into “Other” helps prevent overfitting on rare categories and improves model generalization.

Label Encoding

Why Label Encoding?

Machine learning models require numerical inputs. Label Encoding converts categorical variables into integer representations.

Implementation

Create LabelEncoder instance

from sklearn.preprocessing import LabelEncoder

# Creating a instance of label Encoder.
label_encoder = LabelEncoder()

Identify categorical columns

Select all columns with object or datetime data types:

# List of categorical columns to encode
categorical_columns = full_dataset_preprocessed.select_dtypes(
    ['object', 'datetime64[ns]']
).columns
categorical_columns = list(set(categorical_columns))

Apply encoding

Transform each categorical column into numerical values:

# Aplicar LabelEncoder a cada columna categórica
for column in categorical_columns:
    if column in full_dataset_preprocessed.columns:
        full_dataset_preprocessed[column] = label_encoder.fit_transform(
            full_dataset_preprocessed[column]
        )

Encoded Features

The following categorical features are encoded:

Source (Inbound, Outbound, etc.)
City (Various cities)
Loss Reason (Reasons for lost opportunities)
Pain (Customer pain level)
Discount code (Applied discount codes)
Status (Closed Won, Closed Lost, Other)
Use Case (Type of use case)

Each unique category is assigned a unique integer. For example, if Source has values [“Inbound”, “Outbound”, “Referral”], they might be encoded as [0, 1, 2].

Data Scaling

After encoding, numerical features are scaled using StandardScaler to normalize their ranges:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
# Scale numerical features

StandardScaler transforms features to have zero mean and unit variance, which helps gradient boosting algorithms converge faster and perform better.

Final Output

The fully preprocessed dataset is saved for model training:

full_dataset_preprocessed.to_csv("data/processed/full_dataset.csv", index=False)

Summary

The feature engineering pipeline:

Extracts temporal features (year, month) from dates
Maps target variable to three balanced classes
Encodes all categorical variables using LabelEncoder
Scales numerical features using StandardScaler
Outputs a clean, model-ready dataset

This processed dataset is now ready for training the Gradient Boosting classifier.

Get Started

Core Concepts

Data Preparation

Overview

Date Feature Extraction

Converting to DateTime

Extracting Year and Month

Dropping Original Date Columns

Target Variable Mapping

Status Column Transformation

Label Encoding

Why Label Encoding?

Implementation

Encoded Features

Data Scaling

Final Output

Summary

Build docs developers (and LLMs) love

Get Started

Core Concepts

Data Preparation

​Overview

​Date Feature Extraction

​Converting to DateTime

​Extracting Year and Month

​Dropping Original Date Columns

​Target Variable Mapping

​Status Column Transformation

​Label Encoding

​Why Label Encoding?

​Implementation

​Encoded Features

​Data Scaling

​Final Output

​Summary

Build docs developers (and LLMs) love

Overview

Date Feature Extraction

Converting to DateTime

Extracting Year and Month

Dropping Original Date Columns

Target Variable Mapping

Status Column Transformation

Label Encoding

Why Label Encoding?

Implementation

Encoded Features

Data Scaling

Final Output

Summary