What is NBA Data Preprocessing Pipeline?
The NBA Data Preprocessing Pipeline is a deterministic data preprocessing and feature engineering system built for the NBA2K salary dataset. It’s designed to handle real-world constraints on edge devices and servers by supporting both batch and streaming execution modes. Unlike traditional batch-only pipelines, this system can process data in chunks with explicit memory and compute limits, making it suitable for deployment on resource-constrained devices.Key Features
Dual Execution Modes
Run in batch mode for full-dataset processing or streaming mode for chunked processing under memory constraints
Resource Awareness
Adaptive chunk sizing and memory monitoring prevent failures on constrained devices
Deterministic Processing
Seeded runs and reproducible artifacts ensure consistent results across executions
Comprehensive Telemetry
Built-in hardware profiling, energy monitoring, and performance benchmarking
Pipeline Architecture
The system is organized as a five-stage pipeline:The pipeline includes a resource-aware runtime with CPU/memory/energy telemetry, adaptive chunk control, and optional disk spill for resilience.
Use Cases
Edge Device Deployment
Process NBA salary data on edge devices with limited memory (256MB) and compute resources:High-Performance Server Processing
Maximize throughput on server hardware with parallel execution:Reproducible Research
Ensure deterministic results for research and benchmarking:Design Philosophy
The pipeline prioritizes specific trade-offs for production deployment:Determinism over Maximum Throughput
Determinism over Maximum Throughput
Seeded runs and fixed artifact naming improve reproducibility but can reduce peak speed. This trade-off ensures consistent behavior across environments.
Adaptive Chunk Sizing over Static Sizing
Adaptive Chunk Sizing over Static Sizing
Memory failures are reduced through dynamic chunk adjustment, though per-chunk control logic adds overhead. Critical for edge deployment.
Spill-to-Disk Resilience over Latency
Spill-to-Disk Resilience over Latency
Constrained devices can complete runs by spilling to disk, accepting increased wall-clock time from I/O amplification.
Parallel Benchmarks over Timing Stability
Parallel Benchmarks over Timing Stability
Setting
n_jobs > 1 accelerates constraint sweeps while introducing timing variance. Choose based on your priority.Generated Artifacts
Each pipeline run generates comprehensive artifacts for analysis:- Reports:
pipeline_report.json,streaming_chunks.jsonl - Benchmarks:
constraint_experiment.csv,significance_tests.csv - Visualizations: Latency vs accuracy, memory vs accuracy, 3D plots
- Profiles:
operator_profile.csvfor stage-level performance analysis
Assumptions and Requirements
The pipeline expects:- Input CSV with required columns:
version,salary,b_day,draft_year,height,weight - Write permissions for creating output directories
- Python 3.8+ with dependencies from
requirements.txt - Streaming and batch runs operate on the same schema
Next Steps
Quickstart Guide
Get the pipeline running in minutes
Installation
Complete setup instructions