Overview
Theexport_data() method generates two output files containing processed metrics:
- processed_data.csv: Weekly metrics in pipe-delimited format
- processed_data.xlsx: Monthly metrics across three Excel sheets
Export Methods
The export process is split into two methods called byexport_data():
main.py
CSV Export: Weekly Metrics
File Format
Filename:processed_data.csvDelimiter: Pipe character (
|)Encoding: UTF-8
Index: Not included
Implementation
main.py
Column Structure
The CSV contains 12 columns:| Column Name | Data Type | Description |
|---|---|---|
| year_week | string | Week identifier (format: YYYY-WWW) |
| min_trip_time | float | Minimum trip duration in seconds |
| max_trip_time | float | Maximum trip duration in seconds |
| mean_trip_time | float | Average trip duration in seconds |
| min_trip_distance | float | Minimum trip distance in miles |
| max_trip_distance | float | Maximum trip distance in miles |
| mean_trip_distance | float | Average trip distance in miles |
| min_trip_amount | float | Minimum fare amount in dollars |
| max_trip_amount | float | Maximum fare amount in dollars |
| mean_trip_amount | float | Average fare amount in dollars |
| total_services | integer | Total number of trips |
| percentage_variation | float | Week-over-week change in services (%) |
Example CSV Content
Reading the CSV
- Python (pandas)
- Excel
- Command Line
Excel Export: Monthly Metrics
File Format
Filename:processed_data.xlsxFormat: Excel 2010+ (.xlsx)
Engine: openpyxl
Sheets: 3 (JFK, Regular, Others)
Implementation
main.py
Sheet Structure
All three sheets share the same column structure:| Column Name | Data Type | Description |
|---|---|---|
| year_month | string | Month identifier (format: YYYY-MM) |
| day_type | integer | 1 = Weekday, 2 = Weekend |
| services | integer | Total number of trips |
| distances | float | Total distance traveled in miles |
| passengers | integer | Total passengers transported |
Sheet Contents
- JFK Sheet
- Regular Sheet
- Others Sheet
Airport flat-fare trips (RatecodeID = 2):Typical patterns:
- Average distance: ~17 miles per trip
- Higher weekend passenger counts (leisure travel)
- Seasonal peaks during holidays
Reading the Excel File
- Python (pandas)
- Excel
- Python (openpyxl)
File Locations
Both files are created in the current working directory where the script is executed:Common Use Cases
Time Series Analysis (CSV)
Best for analyzing trends over time:Rate Code Comparison (Excel)
Best for comparing service patterns across rate codes:Weekday vs Weekend Analysis (Excel)
Best for understanding weekly patterns:Revenue Estimation
Combine data from both exports:Data Validation
Verify export integrity:Best Practices
File Management
- Archive old exports before running new analyses
- Use descriptive filenames with date ranges:
processed_data_2022Q1.csv - Keep raw data separate from processed exports
- Document any manual modifications to exported files
Performance Tips
- CSV files are faster to read/write than Excel for large datasets
- Use chunking for processing very large CSV files:
- For Excel, specify
engine='openpyxl'explicitly for better compatibility
Integration with BI Tools
- Tableau/Power BI: Import CSV directly for automatic refresh capabilities
- Google Sheets: Use Google Drive API to upload Excel files programmatically
- SQL Databases: Load CSV data using bulk import:
Next Steps
- Review Weekly Metrics to understand CSV columns
- Explore Monthly Metrics to understand Excel sheets
- Return to the complete Processing Data workflow