Overview
The YellowTaxiData class provides a complete pipeline for processing NYC Yellow Taxi trip data. The workflow involves initializing the object with date ranges, importing raw data, cleaning, enriching with calculated columns, generating metrics, formatting, and exporting results.Complete Workflow
Initialize YellowTaxiData
Create an instance of the The constructor automatically:
YellowTaxiData class with your desired date range:- Generates monthly date ranges between start and end dates
- Creates URLs for downloading parquet files from AWS CloudFront
- Initializes empty DataFrames for storing processed data
Import Data
Download and concatenate parquet files from the NYC Taxi data source:This method:
- Downloads parquet files for each month in your date range
- Filters to essential columns:
tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,total_amount - Sets a multi-level index for efficient querying
Clean Data
Remove invalid and outlier records:Cleaning operations include:
- Removing duplicate records
- Dropping rows with missing datetime or passenger count values
- Filtering trips within the specified date range
- Removing trips where dropoff is before pickup
- Filtering out trips shorter than 60 seconds
- Removing trips with average speeds over 100 mph (likely data errors)
- Filtering trips with distance > 0 and amount between 5000
- Ensuring passenger count > 0
Add Calculated Columns
Enrich the dataset with time-based grouping columns:New columns added:
year_month: Format YYYY-MM for monthly groupingyear_week: Format YYYY-WWW for weekly groupingyear_month_day: Format YYYY-MM-DD for daily analysis
Generate Weekly Metrics
Calculate aggregated statistics by week:Generates min/max/mean for trip time, distance, and amount, plus total services count and week-over-week percentage variation.See Weekly Metrics for detailed information.
Generate Monthly Metrics
Calculate metrics broken down by rate code and day type:Creates separate DataFrames for Regular, JFK, and Other rate codes with weekday/weekend breakdowns.See Monthly Metrics for detailed information.
Format Results
Prepare data for export:This method:
- Rounds numeric values to 2 decimal places
- Resets indexes on monthly metric DataFrames
Export Results
Save processed data to CSV and Excel files:Generates:
processed_data.csv: Weekly metrics (pipe-delimited)processed_data.xlsx: Monthly metrics across three sheets
Complete Example
Here’s the complete workflow as implemented in the main execution block:main.py
Performance Considerations
Memory Usage
- Each month of data contains approximately 3-4 million trip records
- For a 3-month date range, expect ~10-12 million records before cleaning
- After cleaning, this typically reduces to ~8-10 million valid records
- Recommended minimum RAM: 8GB for processing 3+ months of data
Customizing Date Ranges
You can process any date range supported by the NYC Taxi dataset:Next Steps
- Learn about Weekly Metrics calculation
- Understand Monthly Metrics by rate code
- Explore Exporting Results formats