Overview
TheYellowTaxiData class provides a complete pipeline for importing, cleaning, analyzing, and exporting NYC Yellow Taxi trip data. It downloads Parquet files from the NYC TLC dataset, performs data quality checks, generates metrics, and exports results in multiple formats.
Constructor
Start date for data analysis in ‘YYYY-MM-DD’ format (e.g., ‘2022-01-01’)
End date for data analysis in ‘YYYY-MM-DD’ format (e.g., ‘2022-03-31’)
Class Attributes
After initialization, the following attributes are available:The start date of the analysis period
The end date of the analysis period
List of month strings in ‘YYYY-MM’ format generated from start_date to end_date
Weekly date range from start_date to end_date (ending on Sundays)
List of URLs pointing to NYC TLC Parquet files for each month in the date range
Main DataFrame containing the raw and processed taxi trip data
DataFrame for storing weekly date ranges (initialized as empty)
DataFrame for storing monthly date ranges (initialized as empty)
DataFrame containing metrics for JFK airport trips (RatecodeID = 2)
DataFrame containing metrics for regular/standard rate trips (RatecodeID = 1)
DataFrame containing metrics for all other rate codes
DataFrame containing weekly aggregated metrics for CSV export
Methods
The class provides the following methods organized by functionality:Data Import and Cleaning
import_data()- Downloads and imports Parquet files from NYC TLC datasetclean_data()- Applies data quality filters and removes invalid recordsadd_more_columns()- Adds derived date columns for analysis
Metrics Generation
generate_week_metrics()- Calculates weekly aggregated statisticsgenerate_month_metrics()- Generates monthly metrics by rate code and day typeformat_data()- Formats and prepares data for export
Data Export
export_data()- Exports all processed data to filesexport_csv_data()- Exports weekly metrics to CSV formatexport_excel_data()- Exports monthly metrics to Excel format with multiple sheets
Usage Example
Here’s a complete example from the source code:All DataFrames (data, jfk_df, regular_df, other_df, csv_df) are initialized as empty pandas DataFrames and populated by calling the respective methods in sequence.