Data Visualization and Summary Statistics

This guide continues the exploratory data analysis (EDA) series, focusing on visualization techniques and statistical analysis using rideshare data from Chicago.

Introduction

You’ll work with rideshare data from Chicago in 2022, available from the City of Chicago Data Portal. This is a cleaned and reduced version of the full dataset, ready for analysis.

Learning Objectives

Probability

Apply probability concepts to real-world transportation data.

Descriptive Statistics

Compute and interpret mean, median, standard deviation, and quartiles.

Visualization

Create box plots, scatter plots, and geographic visualizations.

Correlation

Analyze relationships between variables using correlation coefficients.

Setup

Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import folium
from folium.plugins import FastMarkerCluster

Load Dataset

# Note: parse_dates automatically converts date columns
df = pd.read_csv(
    "data/rideshare_2022_cleaned.csv",
    parse_dates=['trip_start_timestamp', 'date']
)

df.head()

Summary Statistics

The .describe() method provides a comprehensive statistical overview:

df.describe()

This returns:

count: Number of observations
mean: Average value
std: Standard deviation (spread)
min/max: Range boundaries
25%, 50%, 75%: First quartile, median, third quartile

Questions to Ask When Reviewing Statistics

Range: What are the shortest and longest trips? How far apart are min and max values?
Central Tendency: Is the mean closer to the minimum or maximum?
Spread: How much variation exists? (Check standard deviation)
Skewness: Compare mean vs. median. If mean > median, data is right-skewed (long tail on right)

Be cautious when interpreting aggregated statistics. With data from multiple years and countries, simple averages may hide important patterns.

Box Plots

Creating Box Plots

Box plots visualize summary statistics elegantly:

column_to_plot = 'fare'

plt.figure()
df.boxplot(column_to_plot)
plt.show()

Understanding Box Plot Components

Box Components
Outliers

Box: Interquartile range (IQR) from Q1 to Q3
Orange line in box: Median (Q2)
Lines extending from box: Whiskers showing 1.5 × IQR

Comparing Distributions

Visualize fare distribution to understand outliers:

df.hist('fare', density=True)
plt.show()

The fare distribution is heavily right-skewed with a long tail. This explains why many large fares appear as “outliers” - they’re rare but valid high-fare trips.

Grouped Box Plots

Compare distributions across categories:

df.boxplot(column='tip', by='weekday')

# Limit y-axis for better visibility
plt.ylim(-2, 52)
plt.show()

Understanding Conditional Distributions

When you group by weekday, you’re analyzing conditional distributions:

Tip | Monday
Tip | Tuesday
… and so on

This reveals how tipping behavior changes across days.

Analyzing Tips by Weekday

Get detailed statistics:

df.groupby('weekday')['tip'].describe()

Key Findings:

Sunday: Over 75% of riders don’t tip (Q1, Q2, Q3 all equal 0)
Other days: Median tip is still 0, but Q3 is positive
Implication: When people don’t tip, all tips become “outliers”

When analyzing tips, separate tippers from non-tippers for clearer insights:

df_tippers = df[df['tip'] > 0]
df_tippers.boxplot(column='tip', by='weekday')

Time-Based Analysis

Tips by Hour of Day

Extract hour from timestamp:

# Add hour column
df["hour"] = df["trip_start_timestamp"].apply(lambda x: x.hour)

# Filter to tippers only
df_tippers = df[df['tip'] > 0]

# Plot tip distribution by hour
plt.figure()
df_tippers.boxplot(column='tip', by='hour')
plt.ylim(-2, 52)
plt.show()

# Calculate tipping percentage by hour
percentage_tippers = (
    df_tippers.groupby(["hour"])["tip"].count() / 
    df.groupby(["hour"])["tip"].count() * 100
)

plt.figure()
percentage_tippers.plot(marker="o", title="Percentage of Tippers")
plt.show()

Tips are higher in early morning hours. But is this due to time of day or other factors?

Trip Length by Hour

Check if longer trips explain higher tips:

df.boxplot(column='trip_miles', by='hour')
plt.ylim(-10, 210)
plt.show()

Insight: Early morning trips are longer, potentially explaining higher tips.

Correlation Analysis

Scatter Plots

Visualize relationships:

df_tippers.plot(kind='scatter', x='trip_miles', y='tip', marker=".")
plt.show()

Computing Correlation

Quantify relationships:

correlation = df_tippers['tip'].corr(df_tippers['trip_miles'])
print(f"Correlation: {correlation:.3f}")

Understanding Correlation Coefficients

Correlation ranges from -1 to 1:

1: Perfect positive correlation
0.7: Strong positive correlation
0.5: Moderate positive correlation
0: No linear correlation
-0.5: Moderate negative correlation
-1: Perfect negative correlation

A correlation of 0.637 indicates moderate positive correlation between trip length and tips.

Try comparing different variable pairs:

# Tip vs Fare
df_tippers['tip'].corr(df_tippers['fare'])

# Trip miles vs Fare
df_tippers['trip_miles'].corr(df_tippers['fare'])

Geographic Visualization

2D Histogram

Visualize pickup location density:

# Extract coordinates
latitude = df.dropna()["pickup_centroid_latitude"].to_numpy()
longitude = df.dropna()["pickup_centroid_longitude"].to_numpy()

# Create 2D histogram
fig, ax = plt.subplots(1, 1, figsize=(10, 6))
hist = ax.hist2d(longitude, latitude, bins=50, density=True)
ax.set_aspect(1.3, "box")
fig.colorbar(hist[3])
ax.set_xlabel("Longitude (degrees)")
ax.set_ylabel("Latitude (degrees)")
plt.show()

This is a joint distribution - showing how a variable (ride frequency) is distributed across two dimensions (latitude and longitude).

Interactive Maps

Create an interactive Folium map:

def interactive_map(df, n_samples=4000):
    points = df[["pickup_centroid_longitude", "pickup_centroid_latitude"]].dropna()[0:n_samples]
    
    latitude = points.iloc[0]["pickup_centroid_latitude"]
    longitude = points.iloc[0]["pickup_centroid_longitude"]
    
    map3 = folium.Map(location=[latitude, longitude], zoom_start=9)
    marker_cluster = FastMarkerCluster([]).add_to(map3)
    
    for index, row in points.iterrows():
        lat = row["pickup_centroid_latitude"]
        lon = row["pickup_centroid_longitude"]
        folium.Marker(
            (lat, lon),
            icon=folium.Icon(color="green")
        ).add_to(marker_cluster)
    
    return map3

interactive_map(df)

If the map doesn’t render, try re-running the cell or restarting the kernel. This is a resource-intensive operation.

Analyzing Airport Rides

Filter rides from O’Hare Airport:

# Filter by coordinates
airport_rides = df[
    (df["pickup_centroid_longitude"] < -87.9) &
    (df["pickup_centroid_latitude"] > 41.97) &
    (df["pickup_centroid_latitude"] < 41.99)
]

airport_tippers = airport_rides[airport_rides['tip'] > 0]

# Plot tips by hour
plt.figure()
airport_tippers.boxplot(column='tip', by='hour')
plt.show()

# Calculate tipping percentage
airport_tip_pct = (
    airport_tippers.groupby(["hour"])["tip"].count() / 
    airport_rides.groupby(["hour"])["tip"].count() * 100
)

plt.figure()
airport_tip_pct.plot(marker="o", title="Airport Tipping Percentage")
plt.show()

Airport rides show much higher tipping rates! This is valuable information for drivers choosing where to work.

Key Concepts Covered

Descriptive Statistics

Mean, median, standard deviation, quartiles - the foundation of data understanding.

Box Plots

Visualize distribution, identify outliers, compare groups.

Joint Distribution

2D histograms and maps show how variables relate spatially.

Marginal Distribution

Distributions of individual variables within subgroups.

Correlation

Quantify linear relationships between variables.

Conditional Analysis

Understand how distributions change across categories.

This practical exercise demonstrates why visualization is essential. Summary statistics alone can’t reveal patterns like geographic clustering or time-based trends.

Linear Algebra

Calculus

Probability & Statistics

Data Visualization and Summary Statistics

Introduction

Learning Objectives

Probability

Descriptive Statistics

Visualization

Correlation

Setup

Summary Statistics

Box Plots

Creating Box Plots

Understanding Box Plot Components

Comparing Distributions

Grouped Box Plots

Analyzing Tips by Weekday

Time-Based Analysis

Tips by Hour of Day

Trip Length by Hour

Correlation Analysis

Scatter Plots

Computing Correlation

Geographic Visualization

2D Histogram

Interactive Maps

Analyzing Airport Rides

Key Concepts Covered

Descriptive Statistics

Box Plots

Joint Distribution

Marginal Distribution

Correlation

Conditional Analysis

Build docs developers (and LLMs) love

Linear Algebra

Calculus

Probability & Statistics

​Introduction

​Learning Objectives

Probability

Descriptive Statistics

Visualization

Correlation

​Setup

​Summary Statistics

​Box Plots

​Creating Box Plots

​Understanding Box Plot Components

​Comparing Distributions

​Grouped Box Plots

​Analyzing Tips by Weekday

​Time-Based Analysis

​Tips by Hour of Day

​Trip Length by Hour

​Correlation Analysis

​Scatter Plots

​Computing Correlation

​Geographic Visualization

​2D Histogram

​Interactive Maps

​Analyzing Airport Rides

​Key Concepts Covered

Descriptive Statistics

Box Plots

Joint Distribution

Marginal Distribution

Correlation

Conditional Analysis

Build docs developers (and LLMs) love

Introduction

Learning Objectives

Setup

Summary Statistics

Box Plots

Creating Box Plots

Understanding Box Plot Components

Comparing Distributions

Grouped Box Plots

Analyzing Tips by Weekday

Time-Based Analysis

Tips by Hour of Day

Trip Length by Hour

Correlation Analysis

Scatter Plots

Computing Correlation

Geographic Visualization

2D Histogram

Interactive Maps

Analyzing Airport Rides

Key Concepts Covered