Quickstart

Get your first Yellow Taxi data analysis running in minutes. This guide will walk you through processing 3 months of NYC taxi trip data and generating comprehensive metrics reports.

This quickstart requires an internet connection to download parquet files directly from the NYC TLC Trip Record Data CDN.

Prerequisites

Python 3.9 or higher installed
Internet connection (for downloading data files)
~500MB of free disk space

Clone and navigate to the project

git clone <your-repository-url>
cd yellow-taxi-analytics

Set up virtual environment

Create and activate a Python virtual environment:

python3 -m venv venv
source venv/bin/activate

On Windows, use venv\Scripts\activate instead.

Install dependencies

Install all required packages from requirements.txt:

pip install -r requirements.txt

This installs pandas, numpy, pyarrow, openpyxl, and other dependencies.

Run the analysis

Execute the main script to process January-March 2022 data:

python main.py

You’ll see real-time progress output:

Init objects ...
*** 0.00547895899999995 seconds ***
Importing data ...
*** 7.456109084 seconds ***
Cleaning data ...
*** 5.637706417 seconds ***
Adding more columns ...
*** 30.625991959 seconds ***
Generating week metrics ...
*** 1.227824708 seconds ***
Generating month metrics ...
*** 7.938639625 seconds ***
Formatting results ...
*** 0.0005992090000006556 seconds ***
Exporting results ...
*** 0.09505266699999737 seconds ***
Execution time: 52.987694084 seconds

Processing time may vary based on your internet speed and system performance. Expect 50-90 seconds for 3 months of data.

View the results

Two output files are generated in your project directory:1. Weekly Metrics CSV (processed_data.csv)Pipe-delimited file with weekly trip statistics:

head processed_data.csv

Columns include:

year_week: Year and ISO week number (e.g., 2022-001)
min_trip_time, max_trip_time, mean_trip_time: Trip duration stats (seconds)
min_trip_distance, max_trip_distance, mean_trip_distance: Distance stats (miles)
min_trip_amount, max_trip_amount, mean_trip_amount: Fare stats (USD)
total_services: Number of trips in the week
percentage_variation: Week-over-week change in trip volume

2. Monthly Metrics Excel (processed_data.xlsx)Multi-sheet workbook with three sheets:

JFK Sheet: Trips to/from JFK Airport (RateCodeID: 2)
Regular Sheet: Standard rate trips (RateCodeID: 1)
Others Sheet: All other rate types

Each sheet contains:

year_month: Month (e.g., 2022-01)
day_type: 1 = Weekday, 2 = Weekend
services: Number of trips
distances: Total miles traveled
passengers: Total passenger count

Open in Excel, LibreOffice, or Google Sheets:

# macOS
open processed_data.xlsx

# Linux
xdg-open processed_data.xlsx

# Windows
start processed_data.xlsx

Understanding the Code

The analysis uses the YellowTaxiData class (main.py:5-148). Here’s the basic usage pattern:

import time
from main import YellowTaxiData

# Initialize with date range
yellow_taxi_data = YellowTaxiData(
    start_date='2022-01-01', 
    end_date='2022-03-31'
)

# Run the complete pipeline
yellow_taxi_data.import_data()           # Download parquet files
yellow_taxi_data.clean_data()            # Validate and filter
yellow_taxi_data.add_more_columns()      # Add date/time columns
yellow_taxi_data.generate_week_metrics() # Weekly aggregations
yellow_taxi_data.generate_month_metrics()# Monthly aggregations
yellow_taxi_data.format_data()           # Round and format
yellow_taxi_data.export_data()           # Save CSV and Excel

Customizing the Date Range

To analyze a different time period, modify the date parameters in main.py:156:

# Example: Analyze full year 2022
yellow_taxi_data = YellowTaxiData(
    start_date='2022-01-01', 
    end_date='2022-12-31'
)

Processing more months increases execution time and memory usage. A full year (12 months) may take 3-5 minutes and require 4-8GB of RAM.

Next Steps

Learn more about the installation process and troubleshooting
Explore the code in main.py to understand the data transformations
Run tests with pytest to validate your setup
Customize metrics by modifying the aggregation functions in generate_week_metrics() and generate_month_metrics()

Common Issues

ModuleNotFoundError: No module named 'pandas'

Make sure you’ve activated your virtual environment and installed dependencies:

source venv/bin/activate
pip install -r requirements.txt

Connection timeout or download errors

The script downloads parquet files from d37ci6vzurychx.cloudfront.net. If downloads fail:

Check your internet connection
Verify the date range exists in NYC’s data repository
Try reducing the date range to fewer months

Memory error or system slowdown

Processing large date ranges requires significant RAM:

Start with 1-3 months of data
Close other applications to free memory
Consider processing data in smaller batches

Get Started

Core Concepts

User Guide

API Reference

Development

Quickstart

Quickstart

Prerequisites

Understanding the Code

Customizing the Date Range

Next Steps

Common Issues

Build docs developers (and LLMs) love

Get Started

Core Concepts

User Guide

API Reference

Development

​Quickstart

​Prerequisites

​Understanding the Code

​Customizing the Date Range

​Next Steps

​Common Issues

Build docs developers (and LLMs) love

Quickstart

Prerequisites

Understanding the Code

Customizing the Date Range

Next Steps

Common Issues