Skip to main content

Overview

The run_pipeline.py script is the main entry point for executing the NBA Data Preprocessing Pipeline. It accepts various command-line arguments to configure pipeline behavior. Location: NBA Data Preprocessing/task/run_pipeline.py

Basic Usage

python run_pipeline.py --input <path_to_csv> [OPTIONS]

Required Arguments

--input
string
required
Path to the raw CSV dataset containing NBA data. This is the only required argument.Example:
python run_pipeline.py --input data/raw_nba_data.csv

Configuration Arguments

Template Loading

--config-template
string
Path to a JSON configuration template file. When provided, the template is loaded first, then CLI arguments override specific values.Example:
python run_pipeline.py \
  --input data/raw_nba_data.csv \
  --config-template configs/pipeline.edge.template.json

Output Directory

--output-dir
string
default:"artifacts"
Directory for storing all pipeline outputs including reports, benchmarks, and intermediate results.Example:
python run_pipeline.py \
  --input data/raw_nba_data.csv \
  --output-dir results/experiment_001

Performance Parameters

--chunk-size
int
default:"128"
Number of rows processed per chunk during streaming operations. Smaller values reduce memory usage.Example:
python run_pipeline.py \
  --input data/raw_nba_data.csv \
  --chunk-size 64
--batch-size
int
default:"256"
Batch size for vectorized feature computations. Should typically be larger than chunk-size.Example:
python run_pipeline.py \
  --input data/raw_nba_data.csv \
  --batch-size 512
--n-jobs
int
default:"1"
Number of parallel jobs for multi-threaded operations. Set to -1 to use all available CPU cores.Example:
python run_pipeline.py \
  --input data/raw_nba_data.csv \
  --n-jobs 4

Resource Limits

--max-memory-mb
int
default:"1024"
Maximum memory allocation in megabytes. Pipeline adapts processing to stay within this limit.Example:
python run_pipeline.py \
  --input data/raw_nba_data.csv \
  --max-memory-mb 2048
--max-compute-units
float
default:"1.0"
Relative compute resource allocation (0.0 to 1.0). Used for throttling in shared environments.Example:
python run_pipeline.py \
  --input data/raw_nba_data.csv \
  --max-compute-units 0.5
--spill-to-disk
flag
Enable disk spilling when memory limits are reached. Essential for edge devices with limited RAM.Example:
python run_pipeline.py \
  --input data/raw_nba_data.csv \
  --spill-to-disk

Reliability Settings

--disable-adaptive-chunk-resize
flag
Disable automatic chunk size adjustment. By default, adaptive resizing is enabled.Example:
python run_pipeline.py \
  --input data/raw_nba_data.csv \
  --disable-adaptive-chunk-resize

Benchmarking

--benchmark-runs
int
default:"5"
Number of times to repeat benchmark operations for statistical analysis.Example:
python run_pipeline.py \
  --input data/raw_nba_data.csv \
  --benchmark-runs 10
--random-seed
int
default:"42"
Random seed for reproducible results across pipeline runs.Example:
python run_pipeline.py \
  --input data/raw_nba_data.csv \
  --random-seed 123

Complete Examples

Edge Device Configuration

Run on a resource-constrained device:
python run_pipeline.py \
  --input data/raw_nba_data.csv \
  --chunk-size 64 \
  --batch-size 96 \
  --n-jobs 1 \
  --max-memory-mb 256 \
  --max-compute-units 0.4 \
  --spill-to-disk \
  --output-dir artifacts_edge
Or use the edge template:
python run_pipeline.py \
  --input data/raw_nba_data.csv \
  --config-template configs/pipeline.edge.template.json

Server Configuration

Run on a high-performance server:
python run_pipeline.py \
  --input data/raw_nba_data.csv \
  --chunk-size 256 \
  --batch-size 512 \
  --n-jobs 4 \
  --max-memory-mb 4096 \
  --max-compute-units 1.0 \
  --output-dir artifacts_server
Or use the server template:
python run_pipeline.py \
  --input data/raw_nba_data.csv \
  --config-template configs/pipeline.server.template.json

Template with Overrides

Load a template but override specific values:
python run_pipeline.py \
  --input data/raw_nba_data.csv \
  --config-template configs/pipeline.server.template.json \
  --n-jobs 8 \
  --max-memory-mb 8192 \
  --output-dir artifacts_custom
The template provides base configuration, while CLI arguments override n_jobs, max_memory_mb, and output_dir.

Reproducible Benchmark Run

Run with fixed seed and multiple benchmark iterations:
python run_pipeline.py \
  --input data/raw_nba_data.csv \
  --random-seed 42 \
  --benchmark-runs 10 \
  --output-dir benchmarks/run_001

Conservative Memory Usage

Process large datasets with strict memory limits:
python run_pipeline.py \
  --input data/large_nba_dataset.csv \
  --chunk-size 32 \
  --max-memory-mb 512 \
  --spill-to-disk \
  --n-jobs 1

Output

The pipeline prints a JSON report to stdout containing:
  • Processing statistics
  • Performance metrics
  • Feature engineering results
  • Benchmark data
Example output:
{
  "status": "completed",
  "processing_time_seconds": 45.32,
  "total_rows": 50000,
  "features_generated": 127,
  "memory_peak_mb": 892,
  "benchmark_results": {
    "mean_latency_ms": 12.4,
    "p95_latency_ms": 18.7,
    "throughput_rows_per_sec": 1103
  }
}

Argument Priority

When both template and CLI arguments are provided, values are applied in this order:
  1. Template values - Loaded from JSON file
  2. CLI arguments - Override template values
This allows you to use templates as base configurations and customize specific parameters per run.

Error Handling

Common error scenarios: Missing input file:
$ python run_pipeline.py --input missing.csv
Error: Input file 'missing.csv' not found
Invalid template:
$ python run_pipeline.py --input data.csv --config-template invalid.json
Error: Failed to parse template file: invalid.json
Out of memory: If the pipeline exceeds memory limits, consider:
  • Reducing --chunk-size and --batch-size
  • Lowering --max-memory-mb
  • Enabling --spill-to-disk
  • Reducing --n-jobs

Next Steps

Configuration Overview

Learn about all configuration options

Configuration Templates

Explore pre-built configuration templates

Build docs developers (and LLMs) love