Skip to main content

NBA Data Preprocessing Pipeline

A deterministic data preprocessing and feature engineering pipeline designed for streaming execution under memory and compute constraints. Process NBA salary data efficiently on edge devices and servers alike.

Streaming & Batch Modes
Adaptive Chunk Sizing
Hardware Telemetry
Reproducible Benchmarks
Drift Detection

Quick start

Get your first pipeline running in minutes

1

Install dependencies

Install Python dependencies from requirements.txt:
pip install -r requirements.txt
Required packages: numpy, pandas, scikit-learn, matplotlib, psutil, joblib
2

Prepare your dataset

The pipeline expects NBA salary data in CSV format with these required columns:
  • version — Game version (e.g., “NBA2K20”)
  • salary — Player salary
  • b_day — Birth date (MM/DD/YY format)
  • draft_year — Draft year (YYYY format)
  • height — Player height
  • weight — Player weight
Place your dataset at data/nba2k-full.csv or specify a custom path.
3

Run the pipeline

Execute the pipeline with default settings:
cd "NBA Data Preprocessing/task"
python run_pipeline.py --input ../data/nba2k-full.csv
The pipeline will process your data in both batch and streaming modes, generate benchmarks, and write artifacts to the artifacts/ directory.
{
  "dataset_fingerprint": {
    "sha256": "abc123...",
    "rows": 4550
  },
  "batch": {
    "mode": "batch",
    "rows": 4550,
    "latency_s": 2.34,
    "throughput_rows_s": 1944.2,
    "peak_memory_mb": 128.5
  },
  "streaming": {
    "mode": "streaming",
    "rows": 4550,
    "latency_s": 3.12,
    "throughput_rows_s": 1458.3,
    "peak_memory_mb": 64.2
  }
}

Explore the pipeline

Deep dive into each stage of the data processing pipeline

Architecture

Understand the staged pipeline design and data flow

Execution modes

Learn about batch and streaming processing modes

Resource constraints

Handle memory and compute limits with adaptive sizing

Ingestion

Load data and generate reproducible fingerprints

Feature engineering

Create temporal and rolling features from raw data

Validation

Detect drift, validate schemas, and monitor quality

Performance & deployment

Optimize your pipeline and deploy to constrained environments

Hardware profiling

Profile CPU, memory, and energy consumption with operator-level breakdown

Benchmarking

Run reproducible experiments and statistical significance tests

Edge devices

Deploy to memory-constrained edge devices with spill-to-disk support

Server deployment

Scale up for high-throughput server environments

Ready to process your data?

Follow our quickstart guide to set up the pipeline and process your first dataset. Takes less than 5 minutes.

View Quickstart Guide