Overview
Theclean_data() method applies a series of validation rules to ensure data quality and remove anomalies. Each rule addresses specific data quality issues found in real-world taxi trip data.
Cleaning Pipeline
Data cleaning happens in a specific order to maximize efficiency and ensure logical consistency:main.py:39-63
Validation Rules
1. Duplicate Removal
Rule: Remove Duplicate Records
Rule: Remove Duplicate Records
main.py:40
- Data may be accidentally recorded multiple times due to system errors
- Duplicates would inflate trip counts and revenue metrics
- Ensures each trip is counted exactly once in the analysis
2. Null Value Handling
Rule: Remove Records with Missing Critical Fields
Rule: Remove Records with Missing Critical Fields
main.py:41
- Pickup and dropoff times are essential for calculating trip duration
- Cannot compute time-based metrics without valid timestamps
- Passenger count is required for occupancy analysis
- These fields are fundamental to trip definition
3. Date Range Filtering
Rule: Filter to Analysis Date Range
Rule: Filter to Analysis Date Range
main.py:43-46
- Focuses analysis on the specific time period of interest
- Removes trips that span outside the analysis window
- Prevents edge cases where trips start before or end after the target period
- Ensures consistent temporal boundaries
4. Time Logic Validation
Rule: Dropoff Must Be After Pickup
Rule: Dropoff Must Be After Pickup
main.py:48
- Basic logical constraint: trips cannot end before they start
- Indicates data entry errors or system clock issues
- Zero-duration trips (pickup == dropoff) are not valid trips
- Prevents negative trip durations
5. Minimum Trip Duration
Rule: Trip Must Last At Least 60 Seconds
Rule: Trip Must Last At Least 60 Seconds
main.py:50-52
- Extremely short trips are likely meter activation errors
- Passenger changed their mind immediately after meter started
- System test records or false starts
- Real taxi trips require time to travel even short distances
- Prevents skewing of average trip time metrics
6. Maximum Speed Validation
Rule: Average Speed Must Not Exceed 100 mph
Rule: Average Speed Must Not Exceed 100 mph
main.py:54-58
- NYC speed limits make 100+ mph sustained speeds physically impossible
- Indicates odometer errors or incorrect time recording
- Data entry mistakes (wrong distance or duration)
- GPS/meter calibration issues
- Protects against extreme outliers
This checks average speed over the entire trip, not instantaneous speed. Even highway trips in NYC rarely average above 50 mph due to traffic.
7. Distance Validation
Rule: Trip Distance Must Be Greater Than Zero
Rule: Trip Distance Must Be Greater Than Zero
main.py:60
- Zero-distance trips represent meter errors or cancelled trips
- Negative distances are data errors (impossible)
- All valid trips must cover some distance
- Prevents division-by-zero errors in speed calculations
- Ensures meaningful distance-based metrics
8. Fare Amount Validation
Rule: Total Amount Must Be Between $0 and $5,000
Rule: Total Amount Must Be Between $0 and $5,000
main.py:61
- All trips have minimum charges (base fare + surcharges)
- Zero fares indicate cancelled trips or data errors
- Negative amounts suggest refunds or system errors
- Typical NYC taxi trips range from 100
- Even long-distance trips rarely exceed $500
- $5,000+ suggests data entry errors (extra zeros, decimal errors)
- Prevents extreme outliers from skewing revenue analysis
The $5,000 threshold is intentionally conservative to catch obvious errors while allowing for legitimate high-value trips (e.g., airport runs with multiple passengers and tolls).
9. Passenger Count Validation
Rule: Passenger Count Must Be Greater Than Zero
Rule: Passenger Count Must Be Greater Than Zero
main.py:63
- Taxis cannot operate without at least one passenger
- Zero passengers indicates:
- Driver forgot to enter passenger count
- System default value not updated
- Cancelled trip still recorded
- Negative values are data errors
- Required for passenger volume and occupancy analysis
The TLC dataset allows passenger counts up to 6+ (the typical taxi capacity). No upper bound is enforced in this cleaning step.
Data Quality Impact
Applying these rules typically removes:- 5-15% of raw records (varies by month)
- Duplicate and null entries
- System errors and test records
- Fraudulent or anomalous trips
Cleaning Order MattersRules are applied in a specific sequence:
- Remove duplicates and nulls first (reduces dataset size)
- Apply temporal filters (date range, time logic)
- Apply computed validations (duration, speed)
- Apply value range checks (distance, amount, passengers)
Next Steps
- Understand the Data Model structure
- Learn about Metrics calculated from cleaned data