Dataset Overview
The lead scoring model uses two primary datasets that represent different stages of the sales process:leads.csv
Contains data for all potential clients who entered the sales pipeline, regardless of whether they progressed to the offer stage.
offers.csv
Contains data for clients who reached the demo meeting stage and received formal offers.
The datasets are merged using the Id field as the unique identifier, creating a complete view of each lead’s journey through the sales process.
leads.csv Schema
This dataset captures the initial contact and qualification phase of potential clients.Field Definitions
Id
Id
Type: String/Integer
Description: Unique identifier for the lead. This field is used to merge with the offers dataset.
Description: Unique identifier for the lead. This field is used to merge with the offers dataset.
Rows with null values in the Id column are removed during preprocessing as they cannot be matched to offers.
First Name
First Name
Type: String
Description: Lead’s first name.Preprocessing: This field is dropped during preprocessing as it is personally identifiable information (PII) and not predictive of conversion.
Description: Lead’s first name.Preprocessing: This field is dropped during preprocessing as it is personally identifiable information (PII) and not predictive of conversion.
Use Case
Use Case
Type: Categorical
Description: Type of use case for the potential client (e.g., specific event types or business needs).Preprocessing: Dropped from leads dataset as it duplicates the Use Case field in offers.csv.
Description: Type of use case for the potential client (e.g., specific event types or business needs).Preprocessing: Dropped from leads dataset as it duplicates the Use Case field in offers.csv.
Source
Source
Type: Categorical
Description: Lead acquisition source.Possible Values:
Description: Lead acquisition source.Possible Values:
- Inbound (e.g., website inquiries, content downloads)
- Outbound (e.g., sales outreach, cold calls)
Status
Status
Type: Categorical
Description: Current status of the lead in the qualification pipeline.Preprocessing: Dropped from leads dataset as it refers to the intermediate status, while the final Status from offers.csv is used as the target variable.
Description: Current status of the lead in the qualification pipeline.Preprocessing: Dropped from leads dataset as it refers to the intermediate status, while the final Status from offers.csv is used as the target variable.
Discarded/Nurturing Reason
Discarded/Nurturing Reason
Type: Categorical
Description: Reason for lead discard or placement in nurturing workflow.Preprocessing: Dropped - Over 80% null values make this field unsuitable for modeling.
Description: Reason for lead discard or placement in nurturing workflow.Preprocessing: Dropped - Over 80% null values make this field unsuitable for modeling.
Acquisition Campaign
Acquisition Campaign
Type: Categorical
Description: Marketing campaign that generated the lead.Preprocessing: Dropped - Over 80% null values make this field unsuitable for modeling.
Description: Marketing campaign that generated the lead.Preprocessing: Dropped - Over 80% null values make this field unsuitable for modeling.
Created Date
Created Date
Type: Date (YYYY-MM-DD)
Description: Lead creation date.Preprocessing: Dropped from leads dataset as it duplicates the Created Date field in offers.csv.
Description: Lead creation date.Preprocessing: Dropped from leads dataset as it duplicates the Created Date field in offers.csv.
Converted
Converted
Type: Binary (0/1)
Description: Target variable indicating whether the lead converted.
Description: Target variable indicating whether the lead converted.
- 1 = Converted
- 0 = Not converted
City
City
Type: Categorical
Description: Geographic city location of the lead.Preprocessing: Label encoded. Missing values are imputed with the mode.
Description: Geographic city location of the lead.Preprocessing: Label encoded. Missing values are imputed with the mode.
offers.csv Schema
This dataset contains detailed information about leads who progressed to the offer stage.Field Definitions
Id
Id
Type: String/Integer
Description: Unique identifier for the offer, matching the Id field in leads.csv.Preprocessing: Used for merging datasets, then dropped as it’s not a predictive feature.
Description: Unique identifier for the offer, matching the Id field in leads.csv.Preprocessing: Used for merging datasets, then dropped as it’s not a predictive feature.
Use Case
Use Case
Type: Categorical
Description: Type of use case for the offer (e.g., corporate events, weddings, conferences).Preprocessing: Label encoded for model input.
Description: Type of use case for the offer (e.g., corporate events, weddings, conferences).Preprocessing: Label encoded for model input.
Status
Status
Type: Categorical
Description: TARGET VARIABLE - Final status of the offer representing the conversion outcome.Original Values: Multiple status categories exist in the raw data.Preprocessed Values:
Description: TARGET VARIABLE - Final status of the offer representing the conversion outcome.Original Values: Multiple status categories exist in the raw data.Preprocessed Values:
- Closed Won - Lead successfully converted to paying customer
- Closed Lost - Lead did not convert, opportunity lost
- Other - Minority status categories grouped together
Created Date
Created Date
Type: Date (YYYY-MM-DD)
Description: Offer creation date.Preprocessing: Decomposed into temporal features:Original date field is dropped after feature extraction.
Description: Offer creation date.Preprocessing: Decomposed into temporal features:
Close Date
Close Date
Type: Date (YYYY-MM-DD)
Description: Date when the offer was closed (won or lost).Preprocessing: Decomposed into temporal features:Original date field is dropped after feature extraction.
Description: Date when the offer was closed (won or lost).Preprocessing: Decomposed into temporal features:
Price
Price
Type: Numerical (Float)
Description: Offer price in the local currency.Preprocessing:
Description: Offer price in the local currency.Preprocessing:
- Missing values imputed with the mean
- Scaled using StandardScaler in the model pipeline
Discount code
Discount code
Type: Numerical (Float)
Description: Applied discount code value or percentage.Preprocessing:
Description: Applied discount code value or percentage.Preprocessing:
- Missing values imputed with the mean
- Scaled using StandardScaler in the model pipeline
- Also label encoded as categorical in some preprocessing steps
Pain
Pain
Type: Categorical
Description: Customer’s pain level or urgency of need.Possible Values: Various levels indicating the severity or urgency of the customer’s problem.Preprocessing: Label encoded. Missing values imputed with mode.
Description: Customer’s pain level or urgency of need.Possible Values: Various levels indicating the severity or urgency of the customer’s problem.Preprocessing: Label encoded. Missing values imputed with mode.
Loss Reason
Loss Reason
Type: Categorical
Description: Reason for offer loss (for Closed Lost cases).Preprocessing: Special handling:Then label encoded.
Description: Reason for offer loss (for Closed Lost cases).Preprocessing: Special handling:
Merged Dataset Structure
After merging and preprocessing, the final dataset structure is:- Features
- Target
- Dropped Fields
Categorical Features (Label Encoded):
- Use Case
- Source
- City
- Pain
- Loss Reason
- Price (scaled)
- Discount code (scaled)
- Created Year
- Created Month
- Close Year
- Close Month
Data Quality Considerations
Missing Values
Missing values are handled through:
- Mode imputation for categorical features
- Mean imputation for numerical features
- Conditional imputation for Loss Reason based on Status
Class Imbalance
The target variable Status originally had multiple classes. Minority classes are grouped into “Other” to address imbalance while preserving the main Closed Won/Closed Lost distinction.
Data Fusion
Left join from offers to leads ensures all offers are retained, with missing lead data handled through imputation.
Temporal Features
Date fields are decomposed into year and month components to capture seasonality and trends while maintaining numerical format.
Example Data Flow
Next: Preprocessing
Learn how the data is transformed and prepared for machine learning