Introduction

What is NBA Data Preprocessing Pipeline?

The NBA Data Preprocessing Pipeline is a deterministic data preprocessing and feature engineering system built for the NBA2K salary dataset. It’s designed to handle real-world constraints on edge devices and servers by supporting both batch and streaming execution modes. Unlike traditional batch-only pipelines, this system can process data in chunks with explicit memory and compute limits, making it suitable for deployment on resource-constrained devices.

Key Features

Dual Execution Modes

Run in batch mode for full-dataset processing or streaming mode for chunked processing under memory constraints

Resource Awareness

Adaptive chunk sizing and memory monitoring prevent failures on constrained devices

Deterministic Processing

Seeded runs and reproducible artifacts ensure consistent results across executions

Comprehensive Telemetry

Built-in hardware profiling, energy monitoring, and performance benchmarking

Pipeline Architecture

The system is organized as a five-stage pipeline:

Ingestion

Load source data and compute SHA-256 dataset fingerprint for version tracking

Preprocessing

Parse and normalize fields, handle missing values, and detect outliers

Feature Engineering

Derive temporal and rolling features, then remove high-correlation inputs

Validation

Generate schema checks and drift monitoring for data quality

Evaluation

Train/evaluate models, run constraint experiments, and generate benchmark artifacts

The pipeline includes a resource-aware runtime with CPU/memory/energy telemetry, adaptive chunk control, and optional disk spill for resilience.

Use Cases

Edge Device Deployment

Process NBA salary data on edge devices with limited memory (256MB) and compute resources:

from pipeline.config import PipelineConfig
from pipeline.streaming import RealTimePipelineRunner

config = PipelineConfig(
    chunk_size=64,
    max_memory_mb=256,
    max_compute_units=0.4,
    spill_to_disk=True,
    adaptive_chunk_resize=True
)

runner = RealTimePipelineRunner(config)
report = runner.run_all('data/nba2k-full.csv')

High-Performance Server Processing

Maximize throughput on server hardware with parallel execution:

config = PipelineConfig(
    chunk_size=256,
    batch_size=512,
    max_memory_mb=4096,
    n_jobs=4,
    benchmark_runs=5
)

runner = RealTimePipelineRunner(config)
report = runner.run_all('data/nba2k-full.csv')

Reproducible Research

Ensure deterministic results for research and benchmarking:

config = PipelineConfig(
    random_seed=42,
    benchmark_runs=5,
    output_dir=Path('experiment_001')
)

Design Philosophy

The pipeline prioritizes specific trade-offs for production deployment:

Determinism over Maximum Throughput

Seeded runs and fixed artifact naming improve reproducibility but can reduce peak speed. This trade-off ensures consistent behavior across environments.

Adaptive Chunk Sizing over Static Sizing

Memory failures are reduced through dynamic chunk adjustment, though per-chunk control logic adds overhead. Critical for edge deployment.

Spill-to-Disk Resilience over Latency

Constrained devices can complete runs by spilling to disk, accepting increased wall-clock time from I/O amplification.

Parallel Benchmarks over Timing Stability

Setting n_jobs > 1 accelerates constraint sweeps while introducing timing variance. Choose based on your priority.

Generated Artifacts

Each pipeline run generates comprehensive artifacts for analysis:

Reports: pipeline_report.json, streaming_chunks.jsonl
Benchmarks: constraint_experiment.csv, significance_tests.csv
Visualizations: Latency vs accuracy, memory vs accuracy, 3D plots
Profiles: operator_profile.csv for stage-level performance analysis

Energy telemetry requires RAPL support (Intel platforms). Containers and non-Intel hosts use coarse fallback estimation.

Assumptions and Requirements

The pipeline expects:

Input CSV with required columns: version, salary, b_day, draft_year, height, weight
Write permissions for creating output directories
Python 3.8+ with dependencies from requirements.txt
Streaming and batch runs operate on the same schema

Get Started

Core Concepts

Pipeline Stages

Configuration

Performance

Deployment

What is NBA Data Preprocessing Pipeline?

Key Features

Dual Execution Modes

Resource Awareness

Deterministic Processing

Comprehensive Telemetry

Pipeline Architecture

Use Cases

Edge Device Deployment

High-Performance Server Processing

Reproducible Research

Design Philosophy

Generated Artifacts

Assumptions and Requirements

Next Steps

Quickstart Guide

Installation

Build docs developers (and LLMs) love

Get Started

Core Concepts

Pipeline Stages

Configuration

Performance

Deployment

​What is NBA Data Preprocessing Pipeline?

​Key Features

Dual Execution Modes

Resource Awareness

Deterministic Processing

Comprehensive Telemetry

​Pipeline Architecture

​Use Cases

​Edge Device Deployment

​High-Performance Server Processing

​Reproducible Research

​Design Philosophy

​Generated Artifacts

​Assumptions and Requirements

​Next Steps

Quickstart Guide

Installation

Build docs developers (and LLMs) love

What is NBA Data Preprocessing Pipeline?

Key Features

Pipeline Architecture

Use Cases

Edge Device Deployment

High-Performance Server Processing

Reproducible Research

Design Philosophy

Generated Artifacts

Assumptions and Requirements

Next Steps