Quickstart
Get your first Yellow Taxi data analysis running in minutes. This guide will walk you through processing 3 months of NYC taxi trip data and generating comprehensive metrics reports.This quickstart requires an internet connection to download parquet files directly from the NYC TLC Trip Record Data CDN.
Prerequisites
- Python 3.9 or higher installed
- Internet connection (for downloading data files)
- ~500MB of free disk space
Install dependencies
Install all required packages from requirements.txt:This installs pandas, numpy, pyarrow, openpyxl, and other dependencies.
Run the analysis
Execute the main script to process January-March 2022 data:You’ll see real-time progress output:
Processing time may vary based on your internet speed and system performance. Expect 50-90 seconds for 3 months of data.
View the results
Two output files are generated in your project directory:1. Weekly Metrics CSV (Columns include:
processed_data.csv)Pipe-delimited file with weekly trip statistics:year_week: Year and ISO week number (e.g., 2022-001)min_trip_time,max_trip_time,mean_trip_time: Trip duration stats (seconds)min_trip_distance,max_trip_distance,mean_trip_distance: Distance stats (miles)min_trip_amount,max_trip_amount,mean_trip_amount: Fare stats (USD)total_services: Number of trips in the weekpercentage_variation: Week-over-week change in trip volume
processed_data.xlsx)Multi-sheet workbook with three sheets:- JFK Sheet: Trips to/from JFK Airport (RateCodeID: 2)
- Regular Sheet: Standard rate trips (RateCodeID: 1)
- Others Sheet: All other rate types
year_month: Month (e.g., 2022-01)day_type: 1 = Weekday, 2 = Weekendservices: Number of tripsdistances: Total miles traveledpassengers: Total passenger count
Understanding the Code
The analysis uses theYellowTaxiData class (main.py:5-148). Here’s the basic usage pattern:
Customizing the Date Range
To analyze a different time period, modify the date parameters inmain.py:156:
Next Steps
- Learn more about the installation process and troubleshooting
- Explore the code in
main.pyto understand the data transformations - Run tests with
pytestto validate your setup - Customize metrics by modifying the aggregation functions in
generate_week_metrics()andgenerate_month_metrics()
Common Issues
ModuleNotFoundError: No module named 'pandas'
ModuleNotFoundError: No module named 'pandas'
Make sure you’ve activated your virtual environment and installed dependencies:
Connection timeout or download errors
Connection timeout or download errors
The script downloads parquet files from
d37ci6vzurychx.cloudfront.net. If downloads fail:- Check your internet connection
- Verify the date range exists in NYC’s data repository
- Try reducing the date range to fewer months
Memory error or system slowdown
Memory error or system slowdown
Processing large date ranges requires significant RAM:
- Start with 1-3 months of data
- Close other applications to free memory
- Consider processing data in smaller batches