Overview
Therun_pipeline.py script is the main entry point for executing the NBA Data Preprocessing Pipeline. It accepts various command-line arguments to configure pipeline behavior.
Location: NBA Data Preprocessing/task/run_pipeline.py
Basic Usage
Required Arguments
Path to the raw CSV dataset containing NBA data. This is the only required argument.Example:
Configuration Arguments
Template Loading
Path to a JSON configuration template file. When provided, the template is loaded first, then CLI arguments override specific values.Example:
Output Directory
Directory for storing all pipeline outputs including reports, benchmarks, and intermediate results.Example:
Performance Parameters
Number of rows processed per chunk during streaming operations. Smaller values reduce memory usage.Example:
Batch size for vectorized feature computations. Should typically be larger than chunk-size.Example:
Number of parallel jobs for multi-threaded operations. Set to -1 to use all available CPU cores.Example:
Resource Limits
Maximum memory allocation in megabytes. Pipeline adapts processing to stay within this limit.Example:
Relative compute resource allocation (0.0 to 1.0). Used for throttling in shared environments.Example:
Enable disk spilling when memory limits are reached. Essential for edge devices with limited RAM.Example:
Reliability Settings
Disable automatic chunk size adjustment. By default, adaptive resizing is enabled.Example:
Benchmarking
Number of times to repeat benchmark operations for statistical analysis.Example:
Random seed for reproducible results across pipeline runs.Example:
Complete Examples
Edge Device Configuration
Run on a resource-constrained device:Server Configuration
Run on a high-performance server:Template with Overrides
Load a template but override specific values:n_jobs, max_memory_mb, and output_dir.
Reproducible Benchmark Run
Run with fixed seed and multiple benchmark iterations:Conservative Memory Usage
Process large datasets with strict memory limits:Output
The pipeline prints a JSON report to stdout containing:- Processing statistics
- Performance metrics
- Feature engineering results
- Benchmark data
Argument Priority
When both template and CLI arguments are provided, values are applied in this order:- Template values - Loaded from JSON file
- CLI arguments - Override template values
Error Handling
Common error scenarios: Missing input file:- Reducing
--chunk-sizeand--batch-size - Lowering
--max-memory-mb - Enabling
--spill-to-disk - Reducing
--n-jobs
Next Steps
Configuration Overview
Learn about all configuration options
Configuration Templates
Explore pre-built configuration templates