Skip to main content

Installation

This guide will walk you through setting up your development environment for processing NYC Yellow Taxi trip data.

System Requirements

Python Version

Python 3.9 or higher

Memory

4GB RAM minimum (8GB recommended)

Disk Space

500MB - 2GB free space

Network

Internet connection required
The CI/CD pipeline uses Python 3.9 on Ubuntu 22.04. This is the tested and recommended version.

Step-by-Step Installation

1

Verify Python installation

Check that you have Python 3.9 or higher installed:
python3 --version
Expected output:
Python 3.9.x
macOS (using Homebrew):
brew install [email protected]
Ubuntu/Debian:
sudo apt update
sudo apt install python3.9 python3.9-venv python3.9-dev
Windows: Download from python.org and install Python 3.9 or higher.
2

Clone the repository

Clone the project to your local machine:
git clone <your-repository-url>
cd yellow-taxi-analytics
Verify you’re in the correct directory:
ls -la
You should see:
main.py
requirements.txt
README.md
tests/
3

Create virtual environment

Create an isolated Python environment:
python3 -m venv venv
This creates a venv/ directory containing the Python interpreter and package installation space.
Using a virtual environment prevents package conflicts with other Python projects on your system.
4

Activate virtual environment

Activate the virtual environment:
source venv/bin/activate
Your terminal prompt should now show (venv) prefix:
(venv) user@machine:~/yellow-taxi-analytics$
You must activate the virtual environment every time you open a new terminal session before running the analysis.
5

Upgrade pip

Ensure you have the latest version of pip:
python -m pip install --upgrade pip
6

Install dependencies

Install all required packages from requirements.txt:
pip install -r requirements.txt
This installs the following key packages:
  • pandas 2.2.3 - Data manipulation and analysis
  • numpy 2.0.2 - Numerical computing
  • pyarrow 18.0.0 - Parquet file reading
  • openpyxl 3.1.5 - Excel file writing
  • pytest 8.3.3 - Testing framework (for development)
The complete requirements.txt includes:
pandas==2.2.3
numpy==2.0.2
pyarrow==18.0.0
openpyxl==3.1.5
pytest==8.3.3
python-dateutil==2.9.0.post0
pytz==2024.2
tzdata==2024.2
click==8.1.7
fsspec==2024.10.0
urllib3==2.2.3
certifi==2024.8.30
Plus additional transitive dependencies for compatibility and testing.
7

Verify installation

Test that all packages are installed correctly:
python -c "import pandas, numpy, pyarrow, openpyxl; print('All packages imported successfully!')"
Expected output:
All packages imported successfully!
Run the test suite:
pytest
The tests use sample .parquet files included in the repository and do not require internet connection.

Post-Installation

Quick Test Run

Verify everything works by running a quick analysis:
python main.py
You should see output starting with:
Init objects ...
*** 0.005... seconds ***
Importing data ...

Deactivating the Virtual Environment

When you’re done working, deactivate the virtual environment:
deactivate
The (venv) prefix will disappear from your prompt.

Troubleshooting

This usually means the virtual environment isn’t activated. Run:
source venv/bin/activate  # macOS/Linux
venv\Scripts\activate     # Windows
Then try the pip install command again.
Never use sudo pip install! This can break your system Python.Instead:
  1. Make sure you’re using a virtual environment
  2. Activate it with source venv/bin/activate
  3. Install packages normally with pip install
PyArrow can have compatibility issues. Try:
pip uninstall pyarrow
pip install pyarrow==18.0.0
If issues persist on macOS:
brew install apache-arrow
pip install pyarrow==18.0.0
If you see SSL errors during data import:
pip install --upgrade certifi urllib3
On macOS, you may also need to install certificates:
/Applications/Python\ 3.9/Install\ Certificates.command
Make sure you’re running pytest from the project root directory:
cd /path/to/yellow-taxi-analytics
pytest
The test files expect sample .parquet files to be in the project directory.
If you see this error when activating the virtual environment on Windows PowerShell:
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
Then try activating again:
venv\Scripts\Activate.ps1

Development Setup

Running Tests

The project includes pytest-based tests:
pytest
Run tests with verbose output:
pytest -v

Code Quality

For development, consider installing additional tools:
pip install black flake8 mypy
  • black: Code formatting
  • flake8: Linting
  • mypy: Type checking

Next Steps

Now that you’re set up:

Run Your First Analysis

Follow the quickstart guide to process your first dataset

Understand the Architecture

Learn about the data pipeline and capabilities

System-Specific Notes

macOS

  • Use Homebrew for Python installation
  • You may need Xcode Command Line Tools: xcode-select --install
  • Apple Silicon (M1/M2) users: All dependencies are compatible

Linux

  • Ubuntu 22.04 is the tested platform (same as CI/CD)
  • Install python3.9-dev for building some dependencies
  • Use python3 and pip3 explicitly if multiple Python versions are installed

Windows

  • Use Command Prompt or PowerShell (not Git Bash for activation)
  • Windows Terminal is recommended for better experience
  • Ensure Python was added to PATH during installation

Build docs developers (and LLMs) love