Installation
This guide will walk you through setting up your development environment for processing NYC Yellow Taxi trip data.System Requirements
Python Version
Python 3.9 or higher
Memory
4GB RAM minimum (8GB recommended)
Disk Space
500MB - 2GB free space
Network
Internet connection required
The CI/CD pipeline uses Python 3.9 on Ubuntu 22.04. This is the tested and recommended version.
Step-by-Step Installation
Verify Python installation
Check that you have Python 3.9 or higher installed:Expected output:
Python not installed or wrong version?
Python not installed or wrong version?
macOS (using Homebrew):Ubuntu/Debian:Windows:
Download from python.org and install Python 3.9 or higher.
Clone the repository
Clone the project to your local machine:Verify you’re in the correct directory:You should see:
Create virtual environment
Create an isolated Python environment:This creates a
venv/ directory containing the Python interpreter and package installation space.Activate virtual environment
Activate the virtual environment:Your terminal prompt should now show
(venv) prefix:Install dependencies
Install all required packages from This installs the following key packages:
requirements.txt:- pandas 2.2.3 - Data manipulation and analysis
- numpy 2.0.2 - Numerical computing
- pyarrow 18.0.0 - Parquet file reading
- openpyxl 3.1.5 - Excel file writing
- pytest 8.3.3 - Testing framework (for development)
Full dependency list
Full dependency list
The complete Plus additional transitive dependencies for compatibility and testing.
requirements.txt includes:Post-Installation
Quick Test Run
Verify everything works by running a quick analysis:Deactivating the Virtual Environment
When you’re done working, deactivate the virtual environment:(venv) prefix will disappear from your prompt.
Troubleshooting
pip install fails with 'command not found'
pip install fails with 'command not found'
This usually means the virtual environment isn’t activated. Run:Then try the pip install command again.
Permission denied when installing packages
Permission denied when installing packages
Never use
sudo pip install! This can break your system Python.Instead:- Make sure you’re using a virtual environment
- Activate it with
source venv/bin/activate - Install packages normally with
pip install
ImportError: cannot import name 'pyarrow' after installation
ImportError: cannot import name 'pyarrow' after installation
PyArrow can have compatibility issues. Try:If issues persist on macOS:
SSL certificate errors when downloading data
SSL certificate errors when downloading data
If you see SSL errors during data import:On macOS, you may also need to install certificates:
Tests fail with 'file not found' errors
Tests fail with 'file not found' errors
Make sure you’re running pytest from the project root directory:The test files expect sample
.parquet files to be in the project directory.Windows: 'cannot be loaded because running scripts is disabled'
Windows: 'cannot be loaded because running scripts is disabled'
If you see this error when activating the virtual environment on Windows PowerShell:Then try activating again:
Development Setup
Running Tests
The project includes pytest-based tests:Code Quality
For development, consider installing additional tools:- black: Code formatting
- flake8: Linting
- mypy: Type checking
Next Steps
Now that you’re set up:Run Your First Analysis
Follow the quickstart guide to process your first dataset
Understand the Architecture
Learn about the data pipeline and capabilities
System-Specific Notes
macOS
- Use Homebrew for Python installation
- You may need Xcode Command Line Tools:
xcode-select --install - Apple Silicon (M1/M2) users: All dependencies are compatible
Linux
- Ubuntu 22.04 is the tested platform (same as CI/CD)
- Install
python3.9-devfor building some dependencies - Use
python3andpip3explicitly if multiple Python versions are installed
Windows
- Use Command Prompt or PowerShell (not Git Bash for activation)
- Windows Terminal is recommended for better experience
- Ensure Python was added to PATH during installation