Skip to main content
This guide continues the exploratory data analysis (EDA) series, focusing on visualization techniques and statistical analysis using rideshare data from Chicago.

Introduction

You’ll work with rideshare data from Chicago in 2022, available from the City of Chicago Data Portal. This is a cleaned and reduced version of the full dataset, ready for analysis.

Learning Objectives

Probability

Apply probability concepts to real-world transportation data.

Descriptive Statistics

Compute and interpret mean, median, standard deviation, and quartiles.

Visualization

Create box plots, scatter plots, and geographic visualizations.

Correlation

Analyze relationships between variables using correlation coefficients.

Setup

1

Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import folium
from folium.plugins import FastMarkerCluster
2

Load Dataset

# Note: parse_dates automatically converts date columns
df = pd.read_csv(
    "data/rideshare_2022_cleaned.csv",
    parse_dates=['trip_start_timestamp', 'date']
)

df.head()

Summary Statistics

The .describe() method provides a comprehensive statistical overview:
df.describe()
This returns:
  • count: Number of observations
  • mean: Average value
  • std: Standard deviation (spread)
  • min/max: Range boundaries
  • 25%, 50%, 75%: First quartile, median, third quartile
  1. Range: What are the shortest and longest trips? How far apart are min and max values?
  2. Central Tendency: Is the mean closer to the minimum or maximum?
  3. Spread: How much variation exists? (Check standard deviation)
  4. Skewness: Compare mean vs. median. If mean > median, data is right-skewed (long tail on right)
Be cautious when interpreting aggregated statistics. With data from multiple years and countries, simple averages may hide important patterns.

Box Plots

Creating Box Plots

Box plots visualize summary statistics elegantly:
column_to_plot = 'fare'

plt.figure()
df.boxplot(column_to_plot)
plt.show()

Understanding Box Plot Components

  • Box: Interquartile range (IQR) from Q1 to Q3
  • Orange line in box: Median (Q2)
  • Lines extending from box: Whiskers showing 1.5 × IQR

Comparing Distributions

Visualize fare distribution to understand outliers:
df.hist('fare', density=True)
plt.show()
The fare distribution is heavily right-skewed with a long tail. This explains why many large fares appear as “outliers” - they’re rare but valid high-fare trips.

Grouped Box Plots

Compare distributions across categories:
df.boxplot(column='tip', by='weekday')

# Limit y-axis for better visibility
plt.ylim(-2, 52)
plt.show()
When you group by weekday, you’re analyzing conditional distributions:
  • Tip | Monday
  • Tip | Tuesday
  • … and so on
This reveals how tipping behavior changes across days.

Analyzing Tips by Weekday

Get detailed statistics:
df.groupby('weekday')['tip'].describe()
Key Findings:
  • Sunday: Over 75% of riders don’t tip (Q1, Q2, Q3 all equal 0)
  • Other days: Median tip is still 0, but Q3 is positive
  • Implication: When people don’t tip, all tips become “outliers”
When analyzing tips, separate tippers from non-tippers for clearer insights:
df_tippers = df[df['tip'] > 0]
df_tippers.boxplot(column='tip', by='weekday')

Time-Based Analysis

Tips by Hour of Day

Extract hour from timestamp:
# Add hour column
df["hour"] = df["trip_start_timestamp"].apply(lambda x: x.hour)

# Filter to tippers only
df_tippers = df[df['tip'] > 0]

# Plot tip distribution by hour
plt.figure()
df_tippers.boxplot(column='tip', by='hour')
plt.ylim(-2, 52)
plt.show()

# Calculate tipping percentage by hour
percentage_tippers = (
    df_tippers.groupby(["hour"])["tip"].count() / 
    df.groupby(["hour"])["tip"].count() * 100
)

plt.figure()
percentage_tippers.plot(marker="o", title="Percentage of Tippers")
plt.show()
Tips are higher in early morning hours. But is this due to time of day or other factors?

Trip Length by Hour

Check if longer trips explain higher tips:
df.boxplot(column='trip_miles', by='hour')
plt.ylim(-10, 210)
plt.show()
Insight: Early morning trips are longer, potentially explaining higher tips.

Correlation Analysis

Scatter Plots

Visualize relationships:
df_tippers.plot(kind='scatter', x='trip_miles', y='tip', marker=".")
plt.show()

Computing Correlation

Quantify relationships:
correlation = df_tippers['tip'].corr(df_tippers['trip_miles'])
print(f"Correlation: {correlation:.3f}")
Correlation ranges from -1 to 1:
  • 1: Perfect positive correlation
  • 0.7: Strong positive correlation
  • 0.5: Moderate positive correlation
  • 0: No linear correlation
  • -0.5: Moderate negative correlation
  • -1: Perfect negative correlation
A correlation of 0.637 indicates moderate positive correlation between trip length and tips.
Try comparing different variable pairs:
# Tip vs Fare
df_tippers['tip'].corr(df_tippers['fare'])

# Trip miles vs Fare
df_tippers['trip_miles'].corr(df_tippers['fare'])

Geographic Visualization

2D Histogram

Visualize pickup location density:
# Extract coordinates
latitude = df.dropna()["pickup_centroid_latitude"].to_numpy()
longitude = df.dropna()["pickup_centroid_longitude"].to_numpy()

# Create 2D histogram
fig, ax = plt.subplots(1, 1, figsize=(10, 6))
hist = ax.hist2d(longitude, latitude, bins=50, density=True)
ax.set_aspect(1.3, "box")
fig.colorbar(hist[3])
ax.set_xlabel("Longitude (degrees)")
ax.set_ylabel("Latitude (degrees)")
plt.show()
This is a joint distribution - showing how a variable (ride frequency) is distributed across two dimensions (latitude and longitude).

Interactive Maps

Create an interactive Folium map:
def interactive_map(df, n_samples=4000):
    points = df[["pickup_centroid_longitude", "pickup_centroid_latitude"]].dropna()[0:n_samples]
    
    latitude = points.iloc[0]["pickup_centroid_latitude"]
    longitude = points.iloc[0]["pickup_centroid_longitude"]
    
    map3 = folium.Map(location=[latitude, longitude], zoom_start=9)
    marker_cluster = FastMarkerCluster([]).add_to(map3)
    
    for index, row in points.iterrows():
        lat = row["pickup_centroid_latitude"]
        lon = row["pickup_centroid_longitude"]
        folium.Marker(
            (lat, lon),
            icon=folium.Icon(color="green")
        ).add_to(marker_cluster)
    
    return map3

interactive_map(df)
If the map doesn’t render, try re-running the cell or restarting the kernel. This is a resource-intensive operation.

Analyzing Airport Rides

Filter rides from O’Hare Airport:
# Filter by coordinates
airport_rides = df[
    (df["pickup_centroid_longitude"] < -87.9) &
    (df["pickup_centroid_latitude"] > 41.97) &
    (df["pickup_centroid_latitude"] < 41.99)
]

airport_tippers = airport_rides[airport_rides['tip'] > 0]

# Plot tips by hour
plt.figure()
airport_tippers.boxplot(column='tip', by='hour')
plt.show()

# Calculate tipping percentage
airport_tip_pct = (
    airport_tippers.groupby(["hour"])["tip"].count() / 
    airport_rides.groupby(["hour"])["tip"].count() * 100
)

plt.figure()
airport_tip_pct.plot(marker="o", title="Airport Tipping Percentage")
plt.show()
Airport rides show much higher tipping rates! This is valuable information for drivers choosing where to work.

Key Concepts Covered

Descriptive Statistics

Mean, median, standard deviation, quartiles - the foundation of data understanding.

Box Plots

Visualize distribution, identify outliers, compare groups.

Joint Distribution

2D histograms and maps show how variables relate spatially.

Marginal Distribution

Distributions of individual variables within subgroups.

Correlation

Quantify linear relationships between variables.

Conditional Analysis

Understand how distributions change across categories.
This practical exercise demonstrates why visualization is essential. Summary statistics alone can’t reveal patterns like geographic clustering or time-based trends.

Build docs developers (and LLMs) love