Data Model

Overview

The Yellow Taxi NYC Data Analytics project processes trip data from the NYC Taxi and Limousine Commission (TLC). The data is stored in Parquet format and contains detailed information about each taxi trip.

Data Source

Trip data is downloaded from the NYC TLC public dataset:

https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_{YYYY-MM}.parquet

Parquet Format BenefitsParquet is a columnar storage format that offers:

Efficient compression (smaller file sizes)
Fast column-based queries
Built-in schema preservation
Better performance with pandas and PyArrow

Data Schema

The application uses a subset of columns from the full TLC dataset. Here are the fields used in the analysis:

Core Fields

tpep_pickup_datetime

datetime

required

The date and time when the taxi meter was engaged. Used as part of the multi-index.

# Example: 2022-01-15 08:30:45

tpep_dropoff_datetime

datetime

required

The date and time when the taxi meter was disengaged. Used as part of the multi-index.

# Example: 2022-01-15 08:52:10

passenger_count

float

required

The number of passengers in the vehicle. This is a driver-entered value.

Must be greater than 0 after data cleaning validation.

trip_distance

float

required

The elapsed trip distance in miles reported by the taximeter.

Must be greater than 0 to pass data validation.

RatecodeID

float

required

The final rate code in effect at the end of the trip. Used as part of the multi-index.Rate Code Types:

1 = Standard rate (Regular)
2 = JFK Airport
Other values = Negotiated fare, Nassau/Westchester, Group ride, etc.

total_amount

float

required

The total amount charged to passengers, including all fees and surcharges.

Valid range:

0 < total_amount ≤

5,000

Data Structure

The data is loaded and indexed as follows:

main.py:33-36

self.data = self.data[['tpep_pickup_datetime', 'tpep_dropoff_datetime', 
                       'passenger_count', 'trip_distance',
                       'RatecodeID','total_amount']]
self.data.set_index(['tpep_pickup_datetime', 'tpep_dropoff_datetime', 'RatecodeID'],
                    inplace=True, drop=False)

Multi-Index Design

The DataFrame uses a composite index consisting of:

tpep_pickup_datetime
tpep_dropoff_datetime
RatecodeID

This indexing strategy enables:

Efficient time-based queries
Quick lookups by rate code
Better duplicate detection

Rate Code Categories

The application segments trips into three categories based on RatecodeID:

Regular (RatecodeID = 1)

Standard metered trips within NYC. This represents the majority of yellow taxi trips.

main.py:109-110

if rate_code_id_dict[rc_id] in [1, 2]:
    df = self.data[self.data['RatecodeID'] == rate_code_id_dict[rc_id]]

JFK Airport (RatecodeID = 2)

Trips to/from John F. Kennedy International Airport using the flat-rate fare.These trips are analyzed separately due to their unique pricing structure and distance characteristics.

Other (RatecodeID ≠ 1 or 2)

All other rate codes including:

Negotiated fares
Nassau or Westchester counties
Group rides
Other special arrangements

main.py:111-112

else:
    df = self.data[(self.data['RatecodeID'] != 1) & (self.data['RatecodeID'] != 2)]

Data Loading Process

The import_data() method handles the data ingestion:

main.py:24-36

def import_data(self):
    dataframes_list = [
        pd.read_parquet(
            path=url,
            engine='pyarrow'
        ) for url in self.urls_list
    ]

    self.data = pd.concat(dataframes_list, ignore_index=True)
    self.data = self.data[['tpep_pickup_datetime', 'tpep_dropoff_datetime', 
                           'passenger_count', 'trip_distance',
                           'RatecodeID','total_amount']]
    self.data.set_index(['tpep_pickup_datetime', 'tpep_dropoff_datetime', 'RatecodeID'],
                        inplace=True, drop=False)

Performance NoteThe application uses PyArrow as the Parquet engine for optimal read performance. Multiple monthly files are concatenated into a single DataFrame for analysis.

Next Steps

Learn about Data Cleaning rules and validation
Explore Metrics calculation and aggregation

Get Started

Core Concepts

User Guide

API Reference

Development

Overview

Data Source

Data Schema

Core Fields

Data Structure

Multi-Index Design

Rate Code Categories

Data Loading Process

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

User Guide

API Reference

Development

​Overview

​Data Source

​Data Schema

​Core Fields

​Data Structure

​Multi-Index Design

​Rate Code Categories

​Data Loading Process

​Next Steps

Build docs developers (and LLMs) love

Overview

Data Source

Data Schema

Core Fields

Data Structure

Multi-Index Design

Rate Code Categories

Data Loading Process

Next Steps