Overview
Exploratory Data Analysis (EDA) is the critical process of investigating datasets to discover patterns, spot anomalies, test hypotheses, and check assumptions through statistical summaries and graphical representations.
Why EDA Matters:
Reveals data quality issues before modeling
Uncovers hidden patterns and relationships
Guides feature engineering decisions
Validates business assumptions
Complete EDA Workflow
This comprehensive example analyzes e-commerce sales data from ComercioYA:
Setup and Configuration
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from pathlib import Path
# Configuration
sns.set( style = "whitegrid" , context = "talk" )
plt.rcParams[ 'figure.figsize' ] = ( 10 , 6 )
1. Initial Data Assessment (IDA)
Load and Inspect Data
print ( "=== LESSON 1: Initial EDA (IDA) ===" )
# Load data without header, assign names manually
df = pd.read_csv( DATA_PATH )
df.columns = [ "Order Date" , "Product" , "Category" , "Region" ,
"Quantity" , "Sales" , "Profit" ]
# Convert date column
df[ "Order Date" ] = pd.to_datetime(df[ "Order Date" ])
print ( "First rows:" )
print (df.head())
print ( " \n DataFrame information:" )
print (df.info())
Classify Variables
# Identify variable types
variables_numericas = [ "Quantity" , "Sales" , "Profit" ]
variables_categoricas = [ "Product" , "Category" , "Region" ]
print ( "Numeric variables:" , variables_numericas)
print ( "Categorical variables:" , variables_categoricas)
# Check for missing values
missing = df.isna().sum()
print ( " \n Missing values per column:" )
print (missing)
Create Temporal Features
# Extract time-based features
df[ "Year" ] = df[ "Order Date" ].dt.year
df[ "Month" ] = df[ "Order Date" ].dt.month
df[ "Quarter" ] = df[ "Order Date" ].dt.quarter
df[ "DayOfWeek" ] = df[ "Order Date" ].dt.dayofweek
Always create temporal features early—they’re essential for time-series analysis and trend detection.
2. Descriptive Statistics
Statistical Summary
print ( " \n === LESSON 2: Descriptive Statistics ===" )
# Extended descriptive statistics
desc = df[variables_numericas].describe().T
desc[ "variance" ] = df[variables_numericas].var().values
print ( "Extended descriptive statistics (including variance):" )
print (desc)
Distribution Analysis
# Generate histograms for numeric variables
for col in variables_numericas:
plt.figure( figsize = ( 8 , 5 ))
sns.histplot(df[col], bins = 30 , kde = True )
plt.title( f "Distribution of { col } " )
plt.xlabel(col)
plt.ylabel( "Frequency" )
plt.tight_layout()
plt.savefig( f "hist_ { col.lower() } .png" )
plt.close()
Outlier Detection
# Detect outliers using IQR method
Q1 = df[ "Profit" ].quantile( 0.25 )
Q3 = df[ "Profit" ].quantile( 0.75 )
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers_profit = df[
(df[ "Profit" ] < lower_bound) | (df[ "Profit" ] > upper_bound)
]
print ( f "Number of outliers in Profit: { len (outliers_profit) } " )
print ( f "Percentage: { len (outliers_profit) / len (df) * 100 :.2f} %" )
3. Correlation Analysis
Correlation Matrix
print ( " \n === LESSON 3: Correlation ===" )
# Calculate Pearson correlation
corr_matrix = df[variables_numericas].corr( method = "pearson" )
print ( "Correlation matrix (Pearson):" )
print (corr_matrix)
Correlation Heatmap
plt.figure( figsize = ( 6 , 5 ))
sns.heatmap(corr_matrix, annot = True , cmap = "coolwarm" , vmin =- 1 , vmax = 1 )
plt.title( "Correlation Matrix - Numeric Variables" )
plt.tight_layout()
plt.savefig( "heatmap_correlation.png" )
plt.close()
Strong correlations (|r| > 0.7) indicate potential multicollinearity, which matters for regression models.
Scatter Plots
# Sales vs Profit by Category
plt.figure( figsize = ( 7 , 5 ))
sns.scatterplot( data = df, x = "Sales" , y = "Profit" , hue = "Category" )
plt.title( "Sales vs Profit by Category" )
plt.tight_layout()
plt.savefig( "scatter_sales_profit_category.png" )
plt.close()
# Quantity vs Profit by Region
plt.figure( figsize = ( 7 , 5 ))
sns.scatterplot( data = df, x = "Quantity" , y = "Profit" , hue = "Region" )
plt.title( "Quantity vs Profit by Region" )
plt.tight_layout()
plt.savefig( "scatter_quantity_profit_region.png" )
plt.close()
4. Linear Regression Analysis
Simple Linear Regression
print ( " \n === LESSON 4: Linear Regression ===" )
y = df[ "Profit" ]
# Simple model: Profit ~ Sales
X_simple = df[[ "Sales" ]]
X_simple = sm.add_constant(X_simple)
model_simple = sm.OLS(y, X_simple).fit()
print ( "Simple model: Profit ~ Sales" )
print (model_simple.summary())
Key Metrics
Interpretation
R-squared : Proportion of variance explained (0-1)
Coefficients : Effect of each predictor on the outcome
P-values : Statistical significance (< 0.05 is significant)
Residuals : Difference between predicted and actual values
Profit = β₀ + β₁ * Sales
- β₀ (Intercept): Expected profit when sales = 0
- β₁ (Slope): Change in profit per unit change in sales
- If β₁ = 0.3, then each $1 increase in sales → $0.30 profit
Multiple Linear Regression
# Multiple model: Profit ~ Sales + Quantity
X_multi = df[[ "Sales" , "Quantity" ]]
X_multi = sm.add_constant(X_multi)
model_multi = sm.OLS(y, X_multi).fit()
print ( " \n Multiple model: Profit ~ Sales + Quantity" )
print (model_multi.summary())
Visualize Regression
plt.figure( figsize = ( 7 , 5 ))
sns.regplot( data = df, x = "Sales" , y = "Profit" , scatter_kws = { "alpha" : 0.5 })
plt.title( "Simple Linear Regression: Profit ~ Sales" )
plt.tight_layout()
plt.savefig( "reg_simple_profit_sales.png" )
plt.close()
5. Advanced Visualizations with Seaborn
Pairplot: Multi-variable Relationships
print ( " \n === LESSON 5: Visual Analysis with Seaborn ===" )
# Pairplot for all numeric variables
sns.pairplot(df[[ "Sales" , "Profit" , "Quantity" ]])
plt.savefig( "pairplot_numeric.png" )
plt.close()
Violin Plot: Distribution by Groups
# Distribution of Profit by Region
plt.figure( figsize = ( 8 , 5 ))
sns.violinplot( data = df, x = "Region" , y = "Profit" , inner = "quartile" )
plt.title( "Distribution of Profit by Region" )
plt.tight_layout()
plt.savefig( "violin_profit_region.png" )
plt.close()
Joint Plot: Bivariate Analysis
# Sales vs Profit with marginal distributions
jp = sns.jointplot( data = df, x = "Sales" , y = "Profit" , kind = "hex" , height = 7 )
jp.fig.suptitle( "Joint Plot: Sales vs Profit" , y = 1.02 )
jp.savefig( "joint_sales_profit.png" )
plt.close()
FacetGrid: Multi-faceted Analysis
# Distribution of Sales by Region and Category
g = sns.FacetGrid(df, col = "Region" , hue = "Category" , col_wrap = 2 , height = 4 )
g.map(sns.histplot, "Sales" , bins = 20 , alpha = 0.6 )
g.add_legend()
g.fig.subplots_adjust( top = 0.9 )
g.fig.suptitle( "Sales Distribution by Region and Category" )
g.savefig( "facet_sales_region_category.png" )
plt.close()
When to Use Each Visualization:
Pairplot : Explore all variable relationships at once
Violinplot : Compare distributions across categories
Jointplot : Deep dive into two-variable relationship
FacetGrid : Analyze patterns across multiple dimensions
6. Custom Matplotlib Visualizations
Multi-panel Dashboard
print ( " \n === LESSON 6: Custom Visualizations with Matplotlib ===" )
fig, axes = plt.subplots( 1 , 3 , figsize = ( 18 , 5 ))
# Panel 1: Total Sales by Year
sales_by_year = df.groupby( "Year" )[ "Sales" ].sum()
axes[ 0 ].bar(sales_by_year.index, sales_by_year.values, color = "steelblue" )
axes[ 0 ].set_title( "Total Sales by Year" )
axes[ 0 ].set_xlabel( "Year" )
axes[ 0 ].set_ylabel( "Sales" )
# Panel 2: Total Sales by Category
sales_by_cat = df.groupby( "Category" )[ "Sales" ].sum().sort_values( ascending = False )
axes[ 1 ].bar(sales_by_cat.index, sales_by_cat.values, color = "seagreen" )
axes[ 1 ].set_title( "Total Sales by Category" )
axes[ 1 ].set_xlabel( "Category" )
axes[ 1 ].tick_params( axis = 'x' , rotation = 45 )
# Panel 3: Total Profit by Region
profit_by_region = df.groupby( "Region" )[ "Profit" ].sum()
axes[ 2 ].bar(profit_by_region.index, profit_by_region.values, color = "darkorange" )
axes[ 2 ].set_title( "Total Profit by Region" )
axes[ 2 ].set_xlabel( "Region" )
fig.suptitle( "ComercioYA Performance Summary" , fontsize = 16 )
plt.tight_layout()
plt.subplots_adjust( top = 0.85 )
plt.savefig( "summary_subplots.png" )
plt.close()
Time Series Analysis
# Monthly sales trend
monthly_sales = df.groupby([ "Year" , "Month" ])[ "Sales" ].sum().reset_index()
monthly_sales[ "Date" ] = pd.to_datetime(
monthly_sales[[ "Year" , "Month" ]].assign( day = 1 )
)
plt.figure( figsize = ( 12 , 5 ))
plt.plot(monthly_sales[ "Date" ], monthly_sales[ "Sales" ], marker = 'o' )
plt.title( "Monthly Sales Trend" )
plt.xlabel( "Date" )
plt.ylabel( "Total Sales" )
plt.xticks( rotation = 45 )
plt.grid( True , alpha = 0.3 )
plt.tight_layout()
plt.savefig( "time_series_sales.png" )
plt.close()
Aggregation and Grouping Analysis
Multi-level Aggregations
# Summary by customer
if "id_cliente" in df.columns and "monto" in df.columns:
resumen_cliente = (
df
.groupby( "id_cliente" )
.agg(
monto_total = ( "monto" , "sum" ),
monto_promedio = ( "monto" , "mean" ),
transacciones = ( "monto" , "count" )
)
.reset_index()
)
print ( " \n Summary by customer (first rows):" )
print (resumen_cliente.head())
Pivot Tables
# Create pivot table: Region x Category
pivot = df.pivot_table(
index = "Region" ,
columns = "Category" ,
values = "Sales" ,
aggfunc = "sum" ,
margins = True # Add totals
)
print ( " \n Pivot Table: Sales by Region and Category" )
print (pivot)
Key Insights Generation
Descriptive Summary
# Generate business insights
print ( " \n === KEY INSIGHTS ===" )
# Top performing products
top_products = df.groupby( "Product" )[ "Sales" ].sum().nlargest( 5 )
print ( " \n Top 5 Products by Sales:" )
print (top_products)
# Most profitable regions
profit_by_region = df.groupby( "Region" )[ "Profit" ].sum().sort_values( ascending = False )
print ( " \n Profit by Region (descending):" )
print (profit_by_region)
# Category performance
category_metrics = df.groupby( "Category" ).agg({
"Sales" : "sum" ,
"Profit" : "sum" ,
"Quantity" : "sum"
})
print ( " \n Category Performance Metrics:" )
print (category_metrics)
EDA Best Practices
Start Simple, Then Go Deep
Begin with univariate analysis (one variable at a time)
Move to bivariate analysis (relationships between two variables)
Explore multivariate patterns (three or more variables)
Numbers alone can be misleading (Anscombe’s Quartet)
Use multiple visualization types for the same data
Choose appropriate chart types for your data
Question Your Assumptions
Don’t assume data is clean or complete
Verify business logic with domain experts
Look for unexpected patterns or anomalies
Record all findings and insights
Note data quality issues discovered
Keep track of visualization interpretations
Visualization Selection Guide
Distribution Histogram, KDE, Box Plot, Violin Plot
Relationship Scatter Plot, Line Plot, Regression Plot
Comparison Bar Chart, Grouped Bar, Stacked Bar
Composition Pie Chart, Stacked Area, Treemap
Time Series Line Plot, Area Chart, Time Series
Correlation Heatmap, Pair Plot, Correlation Matrix
Export Analysis Results
# Save all figures to directory
FIGURES_DIR = Path( "figures" )
FIGURES_DIR .mkdir( exist_ok = True )
print ( f " \n All visualizations saved to ' { FIGURES_DIR } /'" )
print ( "Use these results to create your technical EDA report." )
Next Steps
Data Wrangling Clean data for better analysis
Machine Learning Build predictive models from insights
EDA is iterative—return to exploration as you learn more about your data and business problem.