Overview
ThePipelineConfig dataclass provides centralized configuration for all pipeline operations. It controls resource limits, reproducibility settings, and output behavior.
Class Definition
~/workspace/source/NBA Data Preprocessing/task/pipeline/config.py:8
Fields
Random seed for reproducibility across all pipeline operations
Number of rows per chunk in streaming operations
Batch size for model training and evaluation
Number of parallel jobs for constraint experiments. Set to -1 for all CPUs
Maximum memory limit in megabytes for streaming operations
Maximum compute units (0.0-1.0 scale) for resource-constrained execution
Number of runs for benchmark statistical analysis
Enable automatic chunk size reduction when memory limits are exceeded
Maximum retry attempts for processing a chunk before failure
Enable spilling intermediate results to disk during streaming
Root directory for all pipeline outputs
Methods
ensure_output_dirs
Creates directories:
output_dir, intermediate, reports, benchmarks, metadata, and profilesUsage Examples
Basic Configuration
Resource-Constrained Configuration
High-Performance Configuration
Debugging Configuration
Notes
- The dataclass is frozen (immutable) to ensure configuration consistency throughout pipeline execution
- All pipeline components accept a
PipelineConfiginstance and respect its settings - Resource limits (
max_memory_mb,max_compute_units) are soft limits that trigger adaptive behavior rather than hard failures - When
adaptive_chunk_resize=True, the pipeline automatically reduces chunk sizes if memory limits are exceeded