Testing

Testing Framework

This project uses pytest as its testing framework. The test suite is located in test_yellow_taxi_data.py and provides validation for core data processing functionality.

Running Tests

To run the test suite, execute the following command from the project root:

pytest

Make sure you have installed all dependencies first:

pip install -r requirements.txt

Test File Structure

The test suite is organized using pytest fixtures and focused test functions:

Fixture Pattern

The taxi_data fixture (test_yellow_taxi_data.py:6-10) provides a reusable test instance:

@pytest.fixture
def taxi_data():
    data_instance = YellowTaxiData(start_date='2022-03-01', end_date='2022-03-31')
    data_instance.data = pd.read_parquet('yellow_tripdata_2022-03.parquet')
    return data_instance

This fixture:

Creates a YellowTaxiData instance for March 2022
Loads data from a local parquet file (not remote URL)
Returns the configured instance for use in tests
Automatically runs before each test function that uses it

Local vs Remote Data: Tests use local parquet files to avoid network dependencies and ensure fast, reliable test execution. The production code uses remote URLs from the AWS CloudFront CDN.

Test Functions

test_import_data

Validates that the data import process works correctly:

def test_import_data(taxi_data):
    taxi_data.import_data()
    assert not taxi_data.data.empty

What it tests:

Data can be imported successfully
The resulting DataFrame is not empty
The import method completes without errors

test_clean_data

Verifies that the data cleaning process reduces or maintains the dataset size:

def test_clean_data(taxi_data):
    initial_len = taxi_data.data.shape[0]
    taxi_data.clean_data()
    assert len(taxi_data.data) <= initial_len

What it tests:

Cleaning removes invalid records (or keeps the same count)
No records are accidentally added during cleaning
The cleaning method completes without errors

Cleaning operations include:

Removing duplicates
Dropping rows with missing critical fields
Filtering trips outside the date range
Removing trips with invalid durations (<60 seconds)
Filtering trips with unrealistic speeds (>100 mph)
Removing trips with invalid distances or amounts

CI/CD Setup

The project uses GitHub Actions for continuous integration. Tests run automatically on every push to main and on all pull requests.

Workflow Configuration

The workflow is defined in .github/workflows/python-tests.yml:

name: Python Tests

on:
  push:
    branches:
      - main
  pull_request:
    branches:
      - main

jobs:
  test:
    runs-on: ubuntu-22.04

    steps:
    - name: Checkout code
      uses: actions/checkout@v4

    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: 3.9

    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt

    - name: Run pytest
      run: pytest

Key features:

Runs on Ubuntu 22.04
Uses Python 3.9
Automatically installs dependencies
Executes the full test suite

Writing New Tests

Follow these guidelines when adding new tests to the suite:

1. Use the Existing Fixture

Leverage the taxi_data fixture for consistency:

def test_new_feature(taxi_data):
    # Your test code here
    pass

2. Test One Thing Per Function

Keep tests focused and atomic:

def test_add_more_columns(taxi_data):
    taxi_data.add_more_columns()
    assert 'year_month' in taxi_data.data.columns
    assert 'year_week' in taxi_data.data.columns

3. Use Descriptive Names

Name tests clearly to indicate what they validate:

def test_week_metrics_calculates_percentage_variation(taxi_data):
    # Test implementation
    pass

4. Test Edge Cases

Consider boundary conditions and error scenarios:

def test_clean_data_removes_negative_passenger_counts(taxi_data):
    # Add a row with negative passenger_count
    # Run clean_data()
    # Assert it was removed
    pass

Testing with Local vs Remote Data

Local Parquet Files (Tests)

Advantages:

Fast execution (no network latency)
Deterministic results
Works offline
No external dependencies

data_instance.data = pd.read_parquet('yellow_tripdata_2022-03.parquet')

Remote URLs (Production)

Used in production for accessing the latest data:

url = 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-03.parquet'
pd.read_parquet(path=url, engine='pyarrow')

The test parquet files should be committed to the repository and kept in sync with the data structure expected by the production code.

Best Practices

Run tests before committing: Always run pytest locally before pushing changes
Keep tests fast: Use small sample datasets in local parquet files
Mock external dependencies: Avoid network calls in unit tests
Test data quality: Validate that cleaning operations work correctly
Add tests for bug fixes: When fixing a bug, add a test to prevent regression

Next Steps

For performance optimization of the code being tested, see Performance.

Get Started

Core Concepts

User Guide

API Reference

Development

Testing Framework

Running Tests

Test File Structure

Fixture Pattern

Test Functions

test_import_data

test_clean_data

CI/CD Setup

Workflow Configuration

Writing New Tests

1. Use the Existing Fixture

2. Test One Thing Per Function

3. Use Descriptive Names

4. Test Edge Cases

Testing with Local vs Remote Data

Local Parquet Files (Tests)

Remote URLs (Production)

Best Practices

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

User Guide

API Reference

Development

​Testing Framework

​Running Tests

​Test File Structure

​Fixture Pattern

​Test Functions

​test_import_data

​test_clean_data

​CI/CD Setup

​Workflow Configuration

​Writing New Tests

​1. Use the Existing Fixture

​2. Test One Thing Per Function

​3. Use Descriptive Names

​4. Test Edge Cases

​Testing with Local vs Remote Data

​Local Parquet Files (Tests)

​Remote URLs (Production)

​Best Practices

​Next Steps

Build docs developers (and LLMs) love

Testing Framework

Running Tests

Test File Structure

Fixture Pattern

Test Functions

test_import_data

test_clean_data

CI/CD Setup

Workflow Configuration

Writing New Tests

1. Use the Existing Fixture

2. Test One Thing Per Function

3. Use Descriptive Names

4. Test Edge Cases

Testing with Local vs Remote Data

Local Parquet Files (Tests)

Remote URLs (Production)

Best Practices

Next Steps