NBA Data Preprocessing Pipeline
A deterministic data preprocessing and feature engineering pipeline designed for streaming execution under memory and compute constraints. Process NBA salary data efficiently on edge devices and servers alike.
Streaming & Batch Modes
Adaptive Chunk Sizing
Hardware Telemetry
Reproducible Benchmarks
Drift Detection
Quick start
Get your first pipeline running in minutes
Install dependencies
Install Python dependencies from requirements.txt:Required packages: numpy, pandas, scikit-learn, matplotlib, psutil, joblib
Prepare your dataset
The pipeline expects NBA salary data in CSV format with these required columns:
version— Game version (e.g., “NBA2K20”)salary— Player salaryb_day— Birth date (MM/DD/YY format)draft_year— Draft year (YYYY format)height— Player heightweight— Player weight
data/nba2k-full.csv or specify a custom path.Explore the pipeline
Deep dive into each stage of the data processing pipeline
Architecture
Understand the staged pipeline design and data flow
Execution modes
Learn about batch and streaming processing modes
Resource constraints
Handle memory and compute limits with adaptive sizing
Ingestion
Load data and generate reproducible fingerprints
Feature engineering
Create temporal and rolling features from raw data
Validation
Detect drift, validate schemas, and monitor quality
Performance & deployment
Optimize your pipeline and deploy to constrained environments
Hardware profiling
Profile CPU, memory, and energy consumption with operator-level breakdown
Benchmarking
Run reproducible experiments and statistical significance tests
Edge devices
Deploy to memory-constrained edge devices with spill-to-disk support
Server deployment
Scale up for high-throughput server environments
Ready to process your data?
Follow our quickstart guide to set up the pipeline and process your first dataset. Takes less than 5 minutes.
View Quickstart Guide