Overview
The Yellow Taxi NYC Data Analytics project processes trip data from the NYC Taxi and Limousine Commission (TLC). The data is stored in Parquet format and contains detailed information about each taxi trip.Data Source
Trip data is downloaded from the NYC TLC public dataset:Parquet Format BenefitsParquet is a columnar storage format that offers:
- Efficient compression (smaller file sizes)
- Fast column-based queries
- Built-in schema preservation
- Better performance with pandas and PyArrow
Data Schema
The application uses a subset of columns from the full TLC dataset. Here are the fields used in the analysis:Core Fields
The date and time when the taxi meter was engaged. Used as part of the multi-index.
The date and time when the taxi meter was disengaged. Used as part of the multi-index.
The number of passengers in the vehicle. This is a driver-entered value.
Must be greater than 0 after data cleaning validation.
The elapsed trip distance in miles reported by the taximeter.
Must be greater than 0 to pass data validation.
The final rate code in effect at the end of the trip. Used as part of the multi-index.Rate Code Types:
1= Standard rate (Regular)2= JFK Airport- Other values = Negotiated fare, Nassau/Westchester, Group ride, etc.
The total amount charged to passengers, including all fees and surcharges.
Valid range: 5,000
Data Structure
The data is loaded and indexed as follows:main.py:33-36
Multi-Index Design
The DataFrame uses a composite index consisting of:tpep_pickup_datetimetpep_dropoff_datetimeRatecodeID
- Efficient time-based queries
- Quick lookups by rate code
- Better duplicate detection
Rate Code Categories
The application segments trips into three categories based onRatecodeID:
Regular (RatecodeID = 1)
Regular (RatecodeID = 1)
Standard metered trips within NYC. This represents the majority of yellow taxi trips.
main.py:109-110
JFK Airport (RatecodeID = 2)
JFK Airport (RatecodeID = 2)
Trips to/from John F. Kennedy International Airport using the flat-rate fare.These trips are analyzed separately due to their unique pricing structure and distance characteristics.
Other (RatecodeID ≠ 1 or 2)
Other (RatecodeID ≠ 1 or 2)
All other rate codes including:
- Negotiated fares
- Nassau or Westchester counties
- Group rides
- Other special arrangements
main.py:111-112
Data Loading Process
Theimport_data() method handles the data ingestion:
main.py:24-36
Performance NoteThe application uses PyArrow as the Parquet engine for optimal read performance. Multiple monthly files are concatenated into a single DataFrame for analysis.
Next Steps
- Learn about Data Cleaning rules and validation
- Explore Metrics calculation and aggregation