Testing Framework
This project uses pytest as its testing framework. The test suite is located intest_yellow_taxi_data.py and provides validation for core data processing functionality.
Running Tests
To run the test suite, execute the following command from the project root:Test File Structure
The test suite is organized using pytest fixtures and focused test functions:Fixture Pattern
Thetaxi_data fixture (test_yellow_taxi_data.py:6-10) provides a reusable test instance:
- Creates a
YellowTaxiDatainstance for March 2022 - Loads data from a local parquet file (not remote URL)
- Returns the configured instance for use in tests
- Automatically runs before each test function that uses it
Local vs Remote Data: Tests use local parquet files to avoid network dependencies and ensure fast, reliable test execution. The production code uses remote URLs from the AWS CloudFront CDN.
Test Functions
test_import_data
Validates that the data import process works correctly:- Data can be imported successfully
- The resulting DataFrame is not empty
- The import method completes without errors
test_clean_data
Verifies that the data cleaning process reduces or maintains the dataset size:- Cleaning removes invalid records (or keeps the same count)
- No records are accidentally added during cleaning
- The cleaning method completes without errors
- Removing duplicates
- Dropping rows with missing critical fields
- Filtering trips outside the date range
- Removing trips with invalid durations (<60 seconds)
- Filtering trips with unrealistic speeds (>100 mph)
- Removing trips with invalid distances or amounts
CI/CD Setup
The project uses GitHub Actions for continuous integration. Tests run automatically on every push to
main and on all pull requests.Workflow Configuration
The workflow is defined in.github/workflows/python-tests.yml:
- Runs on Ubuntu 22.04
- Uses Python 3.9
- Automatically installs dependencies
- Executes the full test suite
Writing New Tests
1. Use the Existing Fixture
Leverage thetaxi_data fixture for consistency:
2. Test One Thing Per Function
Keep tests focused and atomic:3. Use Descriptive Names
Name tests clearly to indicate what they validate:4. Test Edge Cases
Consider boundary conditions and error scenarios:Testing with Local vs Remote Data
Local Parquet Files (Tests)
Advantages:- Fast execution (no network latency)
- Deterministic results
- Works offline
- No external dependencies
Remote URLs (Production)
Used in production for accessing the latest data:The test parquet files should be committed to the repository and kept in sync with the data structure expected by the production code.
Best Practices
- Run tests before committing: Always run
pytestlocally before pushing changes - Keep tests fast: Use small sample datasets in local parquet files
- Mock external dependencies: Avoid network calls in unit tests
- Test data quality: Validate that cleaning operations work correctly
- Add tests for bug fixes: When fixing a bug, add a test to prevent regression