import_data()
Downloads and imports Yellow Taxi trip data from NYC TLC Parquet files.Description
This method reads Parquet files from the URLs inself.urls_list, concatenates them into a single DataFrame, filters to keep only necessary columns, and sets a multi-level index.
Parameters
No parameters required.Returns
No return value (modifiesself.data in place).
Side Effects
- Modifies
self.data: Populates with imported trip data - Filters columns: Keeps only
tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID, andtotal_amount - Sets index: Creates multi-level index on
tpep_pickup_datetime,tpep_dropoff_datetime, andRatecodeID(columns are not dropped)
Implementation Details
From source/main.py:24-36This method uses PyArrow engine for efficient Parquet file reading. Ensure PyArrow is installed in your environment.
clean_data()
Applies comprehensive data quality filters to remove invalid and outlier records.Description
Performs multiple data cleaning operations including removing duplicates, handling missing values, filtering by date range, validating trip times and speeds, and ensuring positive values for key metrics.Parameters
No parameters required.Returns
No return value (modifiesself.data in place).
Side Effects
- Modifies
self.data: Filters rows based on multiple quality checks - Removes duplicates: Drops duplicate rows
- Removes null values: Drops rows with missing pickup/dropoff times or passenger count
- Filters by date range: Keeps only trips within
start_dateandend_date
Data Quality Filters
The method applies the following filters:- Removes duplicates: Ensures unique trip records
- Null values: Drops rows missing pickup time, dropoff time, or passenger count
- Date range: Pickup >= start_date AND dropoff <= end_date
- Valid trip duration: Dropoff time must be after pickup time
- Minimum duration: Trip duration must be >= 60 seconds
- Speed limit: Average speed must be <= 100 mph (160 km/h)
- Positive distance: Trip distance must be > 0
- Valid amount: Total amount must be > 0 and <= $5,000
- Valid passenger count: Passenger count must be > 0
Implementation Details
From source/main.py:39-63The speed limit filter (100 mph) helps remove unrealistic trips that may be data entry errors or system glitches.
add_more_columns()
Adds derived date and time columns for temporal analysis.Description
Creates additional columns from the dropoff datetime for grouping and aggregation purposes, including year-month, year-week, and date components.Parameters
No parameters required.Returns
No return value (modifiesself.data in place).
Side Effects
Adds the following columns toself.data:
year_month: Format ‘YYYY-MM’ (e.g., ‘2022-01’)year_dt: Year as string (e.g., ‘2022’)week_dt: ISO week number zero-padded to 3 digits (e.g., ‘001’, ‘052’)year_week: Format ‘YYYY-WWW’ (e.g., ‘2022-001’)year_month_day: Format ‘YYYY-MM-DD’ (e.g., ‘2022-01-15’)
Implementation Details
From source/main.py:66-71generate_week_metrics()
Generates weekly aggregated statistics and percentage variations.Description
Calculates trip duration, then aggregates data by week to compute minimum, maximum, and mean values for trip time, distance, and amount. Also calculates week-over-week percentage variation in total services.Parameters
No parameters required.Returns
No return value (modifiesself.csv_df in place).
Side Effects
- Adds columns to
self.data:trip_time: Timedelta between dropoff and pickuptrip_time_in_seconds: Trip duration in seconds
- Populates
self.csv_df: DataFrame with weekly aggregated metrics
Output Columns
Thecsv_df DataFrame contains:
Week identifier in ‘YYYY-WWW’ format
Minimum trip time in seconds for the week
Maximum trip time in seconds for the week
Average trip time in seconds for the week
Minimum trip distance in miles for the week
Maximum trip distance in miles for the week
Average trip distance in miles for the week
Minimum trip amount in dollars for the week
Maximum trip amount in dollars for the week
Average trip amount in dollars for the week
Total number of trips for the week
Week-over-week percentage change in total services (NaN for first week)
Implementation Details
From source/main.py:74-93generate_month_metrics()
Generates monthly metrics segmented by rate code type and day type.Description
Segments data by rate code (Regular, JFK, Other) and day type (weekday vs weekend), then calculates monthly aggregated statistics for services, distances, and passengers.Parameters
No parameters required.Returns
No return value (modifiesself.jfk_df, self.regular_df, and self.other_df in place).
Side Effects
- Adds column to
self.data:day_type: 1 for weekdays (Mon-Fri), 2 for weekends (Sat-Sun)
- Converts
RatecodeID: Changes to integer type - Populates three DataFrames:
regular_df,jfk_df, andother_df
Rate Code Categories
- Regular (RatecodeID = 1): Standard rate trips
- JFK (RatecodeID = 2): JFK airport trips
- Other (RatecodeID != 1 and != 2): All other rate codes (negotiated, Newark, Nassau/Westchester, etc.)
Output DataFrames Structure
Each DataFrame (regular_df, jfk_df, other_df) contains:
Month identifier in ‘YYYY-MM’ format
1 for weekdays, 2 for weekends
Total number of trips for the month and day type
Sum of all trip distances in miles
Sum of all passengers transported
Implementation Details
From source/main.py:96-123Day type uses pandas
dayofweek where Monday=0 and Sunday=6. Days >= 5 (Saturday and Sunday) are marked as weekends (day_type=2).format_data()
Formats and prepares all DataFrames for export.Description
Applies final formatting to all result DataFrames including rounding numeric values and resetting indexes.Parameters
No parameters required.Returns
No return value (modifies DataFrames in place).Side Effects
- Rounds
self.csv_df: All numeric values rounded to 2 decimal places - Resets indexes:
jfk_df,regular_df, andother_dfhave their indexes reset
Implementation Details
From source/main.py:126-131Call this method after
generate_week_metrics() and generate_month_metrics() to ensure all data is properly formatted before export.