Overview
The NBA Data Preprocessing Pipeline is designed as a staged, deterministic system that supports both batch and streaming execution modes. The architecture prioritizes reproducibility, resource awareness, and resilience under constrained environments.Pipeline Stages
The runtime is organized as a five-stage pipeline:Ingestion
Load source data (CSV or DataFrame) and compute a SHA-256 dataset fingerprint for reproducibility tracking.
Preprocessing
Parse and normalize fields, handle missing values, and detect outliers using IQR-based detection.
Feature Engineering
Derive temporal and rolling features, then remove high-correlation inputs to reduce multicollinearity.
System Components
Core Modules
The pipeline is built from modular components defined inNBA Data Preprocessing/task/pipeline/:
- DataIngestor (
ingestion.py): Loads CSV files and computes dataset fingerprints - Preprocessor (
preprocessing.py): Cleans data and detects outliers - FeatureEngineer (
feature_engineering.py): Builds features for both batch and streaming modes - DataValidator (
validation.py): Schema validation, drift detection, and quality reporting - HardwareMonitor (
hardware.py): CPU, memory, and energy telemetry
Streaming Engine
TheRealTimePipelineRunner in streaming/engine.py orchestrates both execution modes:
Resource-Aware Runtime
The pipeline includes adaptive mechanisms for resource-constrained execution:Adaptive Chunk Sizing
Chunk and batch sizes are dynamically adjusted based on memory and compute constraints:Memory Monitoring
Each chunk is monitored for memory usage, with automatic retry and splitting logic:Spill-to-Disk
Optional disk spilling enables processing beyond available RAM:Architecture Diagram
The following diagram illustrates the complete pipeline flow:Design Trade-offs
The architecture makes intentional trade-offs to prioritize specific goals:Determinism over Maximum Throughput
Determinism over Maximum Throughput
Seeded random number generation and fixed artifact naming improve reproducibility but can reduce peak processing speed.
Adaptive Chunk Sizing over Static Sizing
Adaptive Chunk Sizing over Static Sizing
Dynamic chunk adjustment reduces memory failures but adds per-chunk control overhead.
Spill-to-Disk Resilience over Latency
Spill-to-Disk Resilience over Latency
Constrained devices can complete runs by spilling intermediate results to disk, but I/O amplification increases wall-clock time.
Parallel Benchmarks over Strict Timing Stability
Parallel Benchmarks over Strict Timing Stability
Using
n_jobs > 1 accelerates constraint sweeps while introducing timing variance.Configuration
All pipeline behavior is controlled throughPipelineConfig:
Configuration templates are available in
configs/pipeline.edge.template.json and configs/pipeline.server.template.json.Next Steps
Execution Modes
Learn about batch vs streaming execution
Resource Constraints
Understand resource management