Skip to main content

Introduction

The NBA Data Preprocessing Pipeline uses a centralized configuration system through the PipelineConfig class. This allows you to control performance, resource usage, and behavior across different deployment environments.

Configuration Methods

You can configure the pipeline in three ways:
  1. Default Configuration - Use built-in defaults by instantiating PipelineConfig() without arguments
  2. JSON Templates - Load pre-configured templates optimized for specific environments
  3. CLI Arguments - Override settings via command-line arguments when running the pipeline

Configuration Fields

All configuration options are defined in the PipelineConfig dataclass located at NBA Data Preprocessing/task/pipeline/config.py.

Performance Settings

chunk_size
int
default:"128"
Number of rows processed per chunk during streaming operations. Smaller values reduce memory usage but may decrease throughput.
batch_size
int
default:"256"
Batch size for vectorized operations and feature computations. Should typically be larger than chunk_size for optimal performance.
n_jobs
int
default:"1"
Number of parallel jobs for multi-threaded operations. Set to -1 to use all available CPU cores.

Resource Limits

max_memory_mb
int
default:"1024"
Maximum memory allocation in megabytes. The pipeline will adapt processing strategies to stay within this limit.
max_compute_units
float
default:"1.0"
Relative compute resource allocation (0.0 to 1.0). Used for throttling in resource-constrained environments.
spill_to_disk
bool
default:"false"
Enable disk spilling when memory limits are reached. Essential for edge devices with limited RAM.

Reliability Settings

adaptive_chunk_resize
bool
default:"true"
Automatically adjust chunk sizes based on memory pressure and processing time. Improves stability under varying conditions.
max_chunk_retries
int
default:"3"
Maximum number of retry attempts for failed chunk processing before aborting.

Benchmarking

benchmark_runs
int
default:"5"
Number of times to repeat benchmark operations for statistical analysis.
random_seed
int
default:"42"
Random seed for reproducible results across pipeline runs.

Output

output_dir
Path
default:"artifacts"
Root directory for all pipeline outputs. Subdirectories are automatically created for:
  • intermediate/ - Intermediate processing results
  • reports/ - Pipeline execution reports
  • benchmarks/ - Performance benchmark data
  • metadata/ - Dataset and feature metadata
  • profiles/ - Data profiling results

Example: Custom Configuration

from pathlib import Path
from pipeline.config import PipelineConfig

# Create custom configuration
config = PipelineConfig(
    chunk_size=256,
    batch_size=512,
    n_jobs=4,
    max_memory_mb=4096,
    spill_to_disk=False,
    output_dir=Path('my_artifacts')
)

# Ensure output directories exist
config.ensure_output_dirs()

Environment-Specific Templates

For common deployment scenarios, use pre-configured templates:
  • Edge Template (pipeline.edge.template.json) - Optimized for resource-constrained edge devices
  • Server Template (pipeline.server.template.json) - Optimized for high-performance server environments
See the Templates page for detailed specifications.

Next Steps

Configuration Templates

Explore pre-built templates for edge and server deployments

CLI Reference

Learn all command-line options for running the pipeline

Build docs developers (and LLMs) love