Skip to main content

What is NBA Data Preprocessing Pipeline?

The NBA Data Preprocessing Pipeline is a deterministic data preprocessing and feature engineering system built for the NBA2K salary dataset. It’s designed to handle real-world constraints on edge devices and servers by supporting both batch and streaming execution modes. Unlike traditional batch-only pipelines, this system can process data in chunks with explicit memory and compute limits, making it suitable for deployment on resource-constrained devices.

Key Features

Dual Execution Modes

Run in batch mode for full-dataset processing or streaming mode for chunked processing under memory constraints

Resource Awareness

Adaptive chunk sizing and memory monitoring prevent failures on constrained devices

Deterministic Processing

Seeded runs and reproducible artifacts ensure consistent results across executions

Comprehensive Telemetry

Built-in hardware profiling, energy monitoring, and performance benchmarking

Pipeline Architecture

The system is organized as a five-stage pipeline:
1

Ingestion

Load source data and compute SHA-256 dataset fingerprint for version tracking
2

Preprocessing

Parse and normalize fields, handle missing values, and detect outliers
3

Feature Engineering

Derive temporal and rolling features, then remove high-correlation inputs
4

Validation

Generate schema checks and drift monitoring for data quality
5

Evaluation

Train/evaluate models, run constraint experiments, and generate benchmark artifacts
The pipeline includes a resource-aware runtime with CPU/memory/energy telemetry, adaptive chunk control, and optional disk spill for resilience.

Use Cases

Edge Device Deployment

Process NBA salary data on edge devices with limited memory (256MB) and compute resources:
from pipeline.config import PipelineConfig
from pipeline.streaming import RealTimePipelineRunner

config = PipelineConfig(
    chunk_size=64,
    max_memory_mb=256,
    max_compute_units=0.4,
    spill_to_disk=True,
    adaptive_chunk_resize=True
)

runner = RealTimePipelineRunner(config)
report = runner.run_all('data/nba2k-full.csv')

High-Performance Server Processing

Maximize throughput on server hardware with parallel execution:
config = PipelineConfig(
    chunk_size=256,
    batch_size=512,
    max_memory_mb=4096,
    n_jobs=4,
    benchmark_runs=5
)

runner = RealTimePipelineRunner(config)
report = runner.run_all('data/nba2k-full.csv')

Reproducible Research

Ensure deterministic results for research and benchmarking:
config = PipelineConfig(
    random_seed=42,
    benchmark_runs=5,
    output_dir=Path('experiment_001')
)

Design Philosophy

The pipeline prioritizes specific trade-offs for production deployment:
Seeded runs and fixed artifact naming improve reproducibility but can reduce peak speed. This trade-off ensures consistent behavior across environments.
Memory failures are reduced through dynamic chunk adjustment, though per-chunk control logic adds overhead. Critical for edge deployment.
Constrained devices can complete runs by spilling to disk, accepting increased wall-clock time from I/O amplification.
Setting n_jobs > 1 accelerates constraint sweeps while introducing timing variance. Choose based on your priority.

Generated Artifacts

Each pipeline run generates comprehensive artifacts for analysis:
  • Reports: pipeline_report.json, streaming_chunks.jsonl
  • Benchmarks: constraint_experiment.csv, significance_tests.csv
  • Visualizations: Latency vs accuracy, memory vs accuracy, 3D plots
  • Profiles: operator_profile.csv for stage-level performance analysis
Energy telemetry requires RAPL support (Intel platforms). Containers and non-Intel hosts use coarse fallback estimation.

Assumptions and Requirements

The pipeline expects:
  • Input CSV with required columns: version, salary, b_day, draft_year, height, weight
  • Write permissions for creating output directories
  • Python 3.8+ with dependencies from requirements.txt
  • Streaming and batch runs operate on the same schema

Next Steps

Quickstart Guide

Get the pipeline running in minutes

Installation

Complete setup instructions

Build docs developers (and LLMs) love