Skip to main content

Data Profiling with YData

Comprehensive exploratory data analysis was performed using the ydata_profiling library (formerly pandas-profiling), which generates an automated profiling report:
from ydata_profiling import ProfileReport

# Generate comprehensive data profiling report
profile = ProfileReport(df, explorative=True)
profile
The ProfileReport provides automated analysis including:
  • Variable distributions and statistics
  • Correlation matrices
  • Missing value analysis
  • Duplicate detection
  • Data quality warnings

Distribution Analysis

The dataset exhibits well-balanced distributions across all numeric features:

Summary Statistics

FeatureCountMeanStdMin25%50%75%Max
Avg. Session Length50033.050.9929.5332.3433.0833.7136.14
Time on App50012.050.998.5111.3911.9812.7515.13
Time on Website50037.061.0133.9136.3537.0737.7240.01
Length of Membership5003.531.000.272.933.534.136.92
Yearly Amount Spent500499.3179.31256.67445.04498.89549.31765.52

Distribution Characteristics

Key Observation: All features show relatively normal distributions with symmetric quartile spreads, which is ideal for linear regression modeling.
  1. Avg. Session Length
    • Tight distribution around 33 minutes
    • Low variability (SD: 0.99)
    • Minimal outliers
  2. Time on App
    • Moderate engagement averaging 12 minutes
    • Consistent behavior across customers
    • Range indicates varying usage patterns
  3. Time on Website
    • Highest mean engagement time (37 minutes)
    • Slightly higher variability than app usage
    • Suggests website browsing behavior differs from app
  4. Length of Membership
    • Good spread from new to long-term customers
    • Mean of 3.5 years indicates established customer base
    • Important predictor for customer lifetime value
  5. Yearly Amount Spent
    • Wide range from 256.67to256.67 to 765.52
    • Good variance for regression analysis
    • No extreme outliers that would skew the model

Correlation Insights

The exploratory analysis reveals important relationships between features:

Expected Correlations

Positive Correlations

Features that increase together:
  • Length of Membership ↔ Yearly Amount Spent
  • Time on App ↔ Yearly Amount Spent
  • Avg. Session Length ↔ Yearly Amount Spent

Weak Correlations

Features with minimal linear relationship:
  • Time on Website shows weaker correlation with spending
  • Suggests website engagement alone doesn’t drive purchases
Important Finding: The correlation analysis hints that mobile app engagement may be more strongly associated with customer spending than website usage - a key insight for the business decision.

Visualization Analysis

The ProfileReport includes automated visualizations:

Histogram Distributions

  • All numeric features display approximately normal distributions
  • No significant skewness detected
  • Confirms suitability for linear regression assumptions

Missing Value Analysis

  • Result: Zero missing values across all 500 records
  • Complete dataset enables full utilization of all customer records
  • No imputation or data cleaning required

Duplicate Analysis

  • Each customer record is unique (identified by email)
  • No duplicate entries found
  • Data integrity confirmed

Key EDA Findings

  1. Data Quality: Excellent - complete dataset with no missing values or duplicates
  2. Feature Suitability: All numeric features show appropriate distributions for linear regression
  3. Sample Size: 500 customers provide sufficient data for reliable model training
  4. Variable Relationships: Clear positive relationships exist between engagement metrics and spending
  5. Business Insight Preview: Mobile app engagement appears more influential than website usage for driving customer spending
These exploratory findings validate the dataset’s readiness for linear regression modeling and suggest that the model will likely find significant relationships between customer engagement and annual spending.

Next Steps

With the exploratory analysis complete, the data is confirmed suitable for:
  • Linear regression model training
  • Predictive analysis of customer spending
  • Business decision-making regarding app vs. website investment

Build docs developers (and LLMs) love