Introduction
You’ll work with rideshare data from Chicago in 2022, available from the City of Chicago Data Portal. This is a cleaned and reduced version of the full dataset, ready for analysis.Learning Objectives
Probability
Apply probability concepts to real-world transportation data.
Descriptive Statistics
Compute and interpret mean, median, standard deviation, and quartiles.
Visualization
Create box plots, scatter plots, and geographic visualizations.
Correlation
Analyze relationships between variables using correlation coefficients.
Setup
Summary Statistics
The.describe() method provides a comprehensive statistical overview:
- count: Number of observations
- mean: Average value
- std: Standard deviation (spread)
- min/max: Range boundaries
- 25%, 50%, 75%: First quartile, median, third quartile
Questions to Ask When Reviewing Statistics
Questions to Ask When Reviewing Statistics
- Range: What are the shortest and longest trips? How far apart are min and max values?
- Central Tendency: Is the mean closer to the minimum or maximum?
- Spread: How much variation exists? (Check standard deviation)
- Skewness: Compare mean vs. median. If mean > median, data is right-skewed (long tail on right)
Be cautious when interpreting aggregated statistics. With data from multiple years and countries, simple averages may hide important patterns.
Box Plots
Creating Box Plots
Box plots visualize summary statistics elegantly:Understanding Box Plot Components
- Box Components
- Outliers
- Box: Interquartile range (IQR) from Q1 to Q3
- Orange line in box: Median (Q2)
- Lines extending from box: Whiskers showing 1.5 × IQR
Comparing Distributions
Visualize fare distribution to understand outliers:The fare distribution is heavily right-skewed with a long tail. This explains why many large fares appear as “outliers” - they’re rare but valid high-fare trips.
Grouped Box Plots
Compare distributions across categories:Understanding Conditional Distributions
Understanding Conditional Distributions
When you group by weekday, you’re analyzing conditional distributions:
- Tip | Monday
- Tip | Tuesday
- … and so on
Analyzing Tips by Weekday
Get detailed statistics:- Sunday: Over 75% of riders don’t tip (Q1, Q2, Q3 all equal 0)
- Other days: Median tip is still 0, but Q3 is positive
- Implication: When people don’t tip, all tips become “outliers”
Time-Based Analysis
Tips by Hour of Day
Extract hour from timestamp:Tips are higher in early morning hours. But is this due to time of day or other factors?
Trip Length by Hour
Check if longer trips explain higher tips:Correlation Analysis
Scatter Plots
Visualize relationships:Computing Correlation
Quantify relationships:Understanding Correlation Coefficients
Understanding Correlation Coefficients
Correlation ranges from -1 to 1:
- 1: Perfect positive correlation
- 0.7: Strong positive correlation
- 0.5: Moderate positive correlation
- 0: No linear correlation
- -0.5: Moderate negative correlation
- -1: Perfect negative correlation
Geographic Visualization
2D Histogram
Visualize pickup location density:This is a joint distribution - showing how a variable (ride frequency) is distributed across two dimensions (latitude and longitude).
Interactive Maps
Create an interactive Folium map:If the map doesn’t render, try re-running the cell or restarting the kernel. This is a resource-intensive operation.
Analyzing Airport Rides
Filter rides from O’Hare Airport:Key Concepts Covered
Descriptive Statistics
Mean, median, standard deviation, quartiles - the foundation of data understanding.
Box Plots
Visualize distribution, identify outliers, compare groups.
Joint Distribution
2D histograms and maps show how variables relate spatially.
Marginal Distribution
Distributions of individual variables within subgroups.
Correlation
Quantify linear relationships between variables.
Conditional Analysis
Understand how distributions change across categories.
This practical exercise demonstrates why visualization is essential. Summary statistics alone can’t reveal patterns like geographic clustering or time-based trends.
