Skip to main content

Dataset Overview

The lead scoring model uses two primary datasets that represent different stages of the sales process:

leads.csv

Contains data for all potential clients who entered the sales pipeline, regardless of whether they progressed to the offer stage.

offers.csv

Contains data for clients who reached the demo meeting stage and received formal offers.
The datasets are merged using the Id field as the unique identifier, creating a complete view of each lead’s journey through the sales process.

leads.csv Schema

This dataset captures the initial contact and qualification phase of potential clients.

Field Definitions

Type: String/Integer
Description: Unique identifier for the lead. This field is used to merge with the offers dataset.
Rows with null values in the Id column are removed during preprocessing as they cannot be matched to offers.
Type: String
Description: Lead’s first name.
Preprocessing: This field is dropped during preprocessing as it is personally identifiable information (PII) and not predictive of conversion.
Type: Categorical
Description: Type of use case for the potential client (e.g., specific event types or business needs).
Preprocessing: Dropped from leads dataset as it duplicates the Use Case field in offers.csv.
Type: Categorical
Description: Lead acquisition source.
Possible Values:
  • Inbound (e.g., website inquiries, content downloads)
  • Outbound (e.g., sales outreach, cold calls)
Preprocessing: Label encoded for model input.
Type: Categorical
Description: Current status of the lead in the qualification pipeline.
Preprocessing: Dropped from leads dataset as it refers to the intermediate status, while the final Status from offers.csv is used as the target variable.
Type: Categorical
Description: Reason for lead discard or placement in nurturing workflow.
Preprocessing: Dropped - Over 80% null values make this field unsuitable for modeling.
Type: Categorical
Description: Marketing campaign that generated the lead.
Preprocessing: Dropped - Over 80% null values make this field unsuitable for modeling.
Type: Date (YYYY-MM-DD)
Description: Lead creation date.
Preprocessing: Dropped from leads dataset as it duplicates the Created Date field in offers.csv.
Type: Binary (0/1)
Description: Target variable indicating whether the lead converted.
  • 1 = Converted
  • 0 = Not converted
Preprocessing: Dropped from leads dataset as the Status field from offers.csv provides more granular information.
Type: Categorical
Description: Geographic city location of the lead.
Preprocessing: Label encoded. Missing values are imputed with the mode.

offers.csv Schema

This dataset contains detailed information about leads who progressed to the offer stage.

Field Definitions

Type: String/Integer
Description: Unique identifier for the offer, matching the Id field in leads.csv.
Preprocessing: Used for merging datasets, then dropped as it’s not a predictive feature.
Type: Categorical
Description: Type of use case for the offer (e.g., corporate events, weddings, conferences).
Preprocessing: Label encoded for model input.

Status

Type: Categorical
Description: TARGET VARIABLE - Final status of the offer representing the conversion outcome.
Original Values: Multiple status categories exist in the raw data.Preprocessed Values:
# Class mapping to address imbalance
clase_mapping = {
    'Closed Won': 'Closed Won',
    'Closed Lost': 'Closed Lost'
}
# All other statuses → 'Other'
  • Closed Won - Lead successfully converted to paying customer
  • Closed Lost - Lead did not convert, opportunity lost
  • Other - Minority status categories grouped together
This is the primary target variable for the classification model. The three-class structure addresses class imbalance in the original data.
Type: Date (YYYY-MM-DD)
Description: Offer creation date.
Preprocessing: Decomposed into temporal features:
full_dataset_preprocessed['Created Year'] = created_date.dt.year
full_dataset_preprocessed['Created Month'] = created_date.dt.month
Original date field is dropped after feature extraction.
Type: Date (YYYY-MM-DD)
Description: Date when the offer was closed (won or lost).
Preprocessing: Decomposed into temporal features:
full_dataset_preprocessed['Close Year'] = close_date.dt.year
full_dataset_preprocessed['Close Month'] = close_date.dt.month
Original date field is dropped after feature extraction.
Type: Numerical (Float)
Description: Offer price in the local currency.
Preprocessing:
  • Missing values imputed with the mean
  • Scaled using StandardScaler in the model pipeline
ct = ColumnTransformer([
    ('se', StandardScaler(), ['Price', 'Discount code'])
], remainder='passthrough')
Type: Numerical (Float)
Description: Applied discount code value or percentage.
Preprocessing:
  • Missing values imputed with the mean
  • Scaled using StandardScaler in the model pipeline
  • Also label encoded as categorical in some preprocessing steps
Type: Categorical
Description: Customer’s pain level or urgency of need.
Possible Values: Various levels indicating the severity or urgency of the customer’s problem.Preprocessing: Label encoded. Missing values imputed with mode.
Type: Categorical
Description: Reason for offer loss (for Closed Lost cases).
Preprocessing: Special handling:
# If Status is 'Closed Lost' and Loss Reason is null
full_dataset_preprocessed['Loss Reason'] = np.where(
    (full_dataset_preprocessed['Status'] == 'Closed Lost') & 
    (full_dataset_preprocessed['Loss Reason'].isnull()),
    'no response',
    full_dataset_preprocessed['Loss Reason']
)

# If Status is 'Closed Won', impute with mode of Closed Won cases
mode_closed_won = full_dataset_preprocessed.loc[
    full_dataset_preprocessed['Status'] == 'Closed Won', 
    'Loss Reason'
].mode()[0]
full_dataset_preprocessed['Loss Reason'].fillna(mode_closed_won, inplace=True)
Then label encoded.

Merged Dataset Structure

After merging and preprocessing, the final dataset structure is:
Categorical Features (Label Encoded):
  • Use Case
  • Source
  • City
  • Pain
  • Loss Reason
Numerical Features:
  • Price (scaled)
  • Discount code (scaled)
  • Created Year
  • Created Month
  • Close Year
  • Close Month

Data Quality Considerations

Missing Values

Missing values are handled through:
  • Mode imputation for categorical features
  • Mean imputation for numerical features
  • Conditional imputation for Loss Reason based on Status

Class Imbalance

The target variable Status originally had multiple classes. Minority classes are grouped into “Other” to address imbalance while preserving the main Closed Won/Closed Lost distinction.

Data Fusion

Left join from offers to leads ensures all offers are retained, with missing lead data handled through imputation.

Temporal Features

Date fields are decomposed into year and month components to capture seasonality and trends while maintaining numerical format.

Example Data Flow

# 1. Load raw datasets
leads_data = pd.read_csv("data/raw/leads.csv")
offers_data = pd.read_csv("data/raw/offers.csv")

# 2. Clean leads dataset
leads_data_cleaned = leads_data.dropna(subset=['Id'])
leads_data_cleaned = leads_data_cleaned.drop([
    'First Name', 'Use Case', 'Created Date', 'Status', 'Converted'
], axis=1)

# 3. Merge datasets
full_dataset = pd.merge(offers_data, leads_data_cleaned, on='Id', how='left')

# 4. Result: Unified dataset with all fields
# See preprocessing documentation for transformation details

Next: Preprocessing

Learn how the data is transformed and prepared for machine learning

Build docs developers (and LLMs) love