Data Fusion

Overview

The Lead Scoring Model uses data from two separate CSV files that reflect different phases of the sales process. Data fusion combines these datasets into a single comprehensive dataset for analysis and model training.

Source Datasets

leads.csv

Contains data about all potential clients who have shown interest:

Id: Unique identifier for the lead
First Name: Lead’s first name
Use Case: Type of use case for the potential client
Source: Lead source (e.g., Inbound, Outbound)
Status: Current status of the lead
Discarded/Nurturing Reason: Reason for lead discard or nurturing
Acquisition Campaign: Acquisition campaign that generated the lead
Created Date: Lead creation date
Converted: Whether the lead converted (1) or not (0)
City: City of the lead

offers.csv

Contains data about clients who reached at least the demo meeting phase:

Id: Unique identifier for the offer
Use Case: Type of use case for the offer
Status: Current status of the offer (target variable)
Created Date: Offer creation date
Close Date: Offer closing date
Price: Offer price
Discount code: Applied discount code
Pain: Customer potential’s pain level
Loss Reason: Reason for offer loss

Fusion Process

Load the datasets

Both CSV files are loaded into pandas DataFrames:

leads_data = pd.read_csv("data/raw/leads.csv")
offers_data = pd.read_csv("data/raw/offers.csv")

Clean the leads dataset

Remove null values and redundant columns from leads data:

# Delete rows with null values in the 'Id' column
leads_data_cleaned = leads_data.dropna(subset=['Id'])

# Drop multiple columns
# 'First Name' is irrelevant
# The columns of Leads dataset such as 'Use Case' and 'Created Date' 
# are the same as the columns of Offers dataset
# The 'Status' and 'Converted' columns of Leads dataset refers to 
# the 'Status' column of Offers dataset
leads_data_cleaned = leads_data_cleaned.drop(
    ['First Name', 'Use Case', 'Created Date', 'Status', 'Converted'], 
    axis=1
)

Columns like Use Case, Created Date, Status, and Converted from leads.csv are dropped because they duplicate information already present in offers.csv.

Merge the datasets

Perform a left join using the Id column as the key:

# Merge the datasets using the 'Id' column as a key
full_dataset = pd.merge(offers_data, leads_data_cleaned, on='Id', how='left')
full_dataset.to_csv("data/interim/full_dataset.csv", index=False)

This creates a unified dataset that combines offer details with lead information.

Remove high-null columns

Drop columns with more than 80% missing values:

# Drop multiple columns
# 'Id' is irrelevant
# 'Discarded/Nurturing Reason' and 'Acquisition Campaign' have more than 80% null values.
# The missing data is significant and may not provide valuable information for the model.
full_dataset_preprocessed = full_dataset.drop(
    ['Id', 'Discarded/Nurturing Reason', 'Acquisition Campaign'], 
    axis=1
)

Columns with excessive missing data (>80% null values) are removed as they don’t provide reliable information for model training.

Resulting Dataset

After fusion, the combined dataset contains:

All offer-related features (Use Case, Status, dates, Price, Discount code, Pain, Loss Reason)
Relevant lead information (Source, City)
No duplicate columns or irrelevant identifiers

The fused dataset is saved to data/interim/full_dataset.csv for further preprocessing.

Next Steps

After data fusion, the dataset undergoes:

Data Cleaning - Handle missing values and duplicates
Feature Engineering - Transform and create new features

Get Started

Core Concepts

Data Preparation

Overview

Source Datasets

leads.csv

offers.csv

Fusion Process

Resulting Dataset

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Data Preparation

​Overview

​Source Datasets

​leads.csv

​offers.csv

​Fusion Process

​Resulting Dataset

​Next Steps

Build docs developers (and LLMs) love

Overview

Source Datasets

leads.csv

offers.csv

Fusion Process

Resulting Dataset

Next Steps