NBA Data Preprocessing Pipeline

A deterministic data preprocessing and feature engineering pipeline designed for streaming execution under memory and compute constraints. Process NBA salary data efficiently on edge devices and servers alike.

Get Started Learn Architecture

Streaming & Batch Modes

Adaptive Chunk Sizing

Hardware Telemetry

Reproducible Benchmarks

Drift Detection

Quick start

Get your first pipeline running in minutes

Install dependencies

Install Python dependencies from requirements.txt:

pip install -r requirements.txt

Required packages: numpy, pandas, scikit-learn, matplotlib, psutil, joblib

Prepare your dataset

The pipeline expects NBA salary data in CSV format with these required columns:

version — Game version (e.g., “NBA2K20”)
salary — Player salary
b_day — Birth date (MM/DD/YY format)
draft_year — Draft year (YYYY format)
height — Player height
weight — Player weight

Place your dataset at data/nba2k-full.csv or specify a custom path.

Run the pipeline

Execute the pipeline with default settings:

cd "NBA Data Preprocessing/task"
python run_pipeline.py --input ../data/nba2k-full.csv

The pipeline will process your data in both batch and streaming modes, generate benchmarks, and write artifacts to the artifacts/ directory.

View sample output

{
  "dataset_fingerprint": {
    "sha256": "abc123...",
    "rows": 4550
  },
  "batch": {
    "mode": "batch",
    "rows": 4550,
    "latency_s": 2.34,
    "throughput_rows_s": 1944.2,
    "peak_memory_mb": 128.5
  },
  "streaming": {
    "mode": "streaming",
    "rows": 4550,
    "latency_s": 3.12,
    "throughput_rows_s": 1458.3,
    "peak_memory_mb": 64.2
  }
}

Explore the pipeline

Deep dive into each stage of the data processing pipeline

Architecture

Understand the staged pipeline design and data flow

Execution modes

Learn about batch and streaming processing modes

Resource constraints

Handle memory and compute limits with adaptive sizing

Ingestion

Load data and generate reproducible fingerprints

Feature engineering

Create temporal and rolling features from raw data

Validation

Detect drift, validate schemas, and monitor quality

Performance & deployment

Optimize your pipeline and deploy to constrained environments

Hardware profiling

Profile CPU, memory, and energy consumption with operator-level breakdown

Benchmarking

Run reproducible experiments and statistical significance tests

Edge devices

Deploy to memory-constrained edge devices with spill-to-disk support

Server deployment

Scale up for high-throughput server environments

Ready to process your data?

Follow our quickstart guide to set up the pipeline and process your first dataset. Takes less than 5 minutes.

View Quickstart Guide

Get Started

Core Concepts

Pipeline Stages

Configuration

Performance

Deployment