Skip to main content

Quickstart

Get your first Yellow Taxi data analysis running in minutes. This guide will walk you through processing 3 months of NYC taxi trip data and generating comprehensive metrics reports.
This quickstart requires an internet connection to download parquet files directly from the NYC TLC Trip Record Data CDN.

Prerequisites

  • Python 3.9 or higher installed
  • Internet connection (for downloading data files)
  • ~500MB of free disk space
1

Clone and navigate to the project

git clone <your-repository-url>
cd yellow-taxi-analytics
2

Set up virtual environment

Create and activate a Python virtual environment:
python3 -m venv venv
source venv/bin/activate
On Windows, use venv\Scripts\activate instead.
3

Install dependencies

Install all required packages from requirements.txt:
pip install -r requirements.txt
This installs pandas, numpy, pyarrow, openpyxl, and other dependencies.
4

Run the analysis

Execute the main script to process January-March 2022 data:
python main.py
You’ll see real-time progress output:
Init objects ...
*** 0.00547895899999995 seconds ***
Importing data ...
*** 7.456109084 seconds ***
Cleaning data ...
*** 5.637706417 seconds ***
Adding more columns ...
*** 30.625991959 seconds ***
Generating week metrics ...
*** 1.227824708 seconds ***
Generating month metrics ...
*** 7.938639625 seconds ***
Formatting results ...
*** 0.0005992090000006556 seconds ***
Exporting results ...
*** 0.09505266699999737 seconds ***
Execution time: 52.987694084 seconds
Processing time may vary based on your internet speed and system performance. Expect 50-90 seconds for 3 months of data.
5

View the results

Two output files are generated in your project directory:1. Weekly Metrics CSV (processed_data.csv)Pipe-delimited file with weekly trip statistics:
head processed_data.csv
Columns include:
  • year_week: Year and ISO week number (e.g., 2022-001)
  • min_trip_time, max_trip_time, mean_trip_time: Trip duration stats (seconds)
  • min_trip_distance, max_trip_distance, mean_trip_distance: Distance stats (miles)
  • min_trip_amount, max_trip_amount, mean_trip_amount: Fare stats (USD)
  • total_services: Number of trips in the week
  • percentage_variation: Week-over-week change in trip volume
2. Monthly Metrics Excel (processed_data.xlsx)Multi-sheet workbook with three sheets:
  • JFK Sheet: Trips to/from JFK Airport (RateCodeID: 2)
  • Regular Sheet: Standard rate trips (RateCodeID: 1)
  • Others Sheet: All other rate types
Each sheet contains:
  • year_month: Month (e.g., 2022-01)
  • day_type: 1 = Weekday, 2 = Weekend
  • services: Number of trips
  • distances: Total miles traveled
  • passengers: Total passenger count
Open in Excel, LibreOffice, or Google Sheets:
# macOS
open processed_data.xlsx

# Linux
xdg-open processed_data.xlsx

# Windows
start processed_data.xlsx

Understanding the Code

The analysis uses the YellowTaxiData class (main.py:5-148). Here’s the basic usage pattern:
import time
from main import YellowTaxiData

# Initialize with date range
yellow_taxi_data = YellowTaxiData(
    start_date='2022-01-01', 
    end_date='2022-03-31'
)

# Run the complete pipeline
yellow_taxi_data.import_data()           # Download parquet files
yellow_taxi_data.clean_data()            # Validate and filter
yellow_taxi_data.add_more_columns()      # Add date/time columns
yellow_taxi_data.generate_week_metrics() # Weekly aggregations
yellow_taxi_data.generate_month_metrics()# Monthly aggregations
yellow_taxi_data.format_data()           # Round and format
yellow_taxi_data.export_data()           # Save CSV and Excel

Customizing the Date Range

To analyze a different time period, modify the date parameters in main.py:156:
# Example: Analyze full year 2022
yellow_taxi_data = YellowTaxiData(
    start_date='2022-01-01', 
    end_date='2022-12-31'
)
Processing more months increases execution time and memory usage. A full year (12 months) may take 3-5 minutes and require 4-8GB of RAM.

Next Steps

  • Learn more about the installation process and troubleshooting
  • Explore the code in main.py to understand the data transformations
  • Run tests with pytest to validate your setup
  • Customize metrics by modifying the aggregation functions in generate_week_metrics() and generate_month_metrics()

Common Issues

Make sure you’ve activated your virtual environment and installed dependencies:
source venv/bin/activate
pip install -r requirements.txt
The script downloads parquet files from d37ci6vzurychx.cloudfront.net. If downloads fail:
  • Check your internet connection
  • Verify the date range exists in NYC’s data repository
  • Try reducing the date range to fewer months
Processing large date ranges requires significant RAM:
  • Start with 1-3 months of data
  • Close other applications to free memory
  • Consider processing data in smaller batches

Build docs developers (and LLMs) love