Skip to main content

Overview

The PipelineConfig dataclass provides centralized configuration for all pipeline operations. It controls resource limits, reproducibility settings, and output behavior.

Class Definition

@dataclass(frozen=True)
class PipelineConfig
Source: ~/workspace/source/NBA Data Preprocessing/task/pipeline/config.py:8

Fields

random_seed
int
default:"42"
Random seed for reproducibility across all pipeline operations
chunk_size
int
default:"128"
Number of rows per chunk in streaming operations
batch_size
int
default:"256"
Batch size for model training and evaluation
n_jobs
int
default:"1"
Number of parallel jobs for constraint experiments. Set to -1 for all CPUs
max_memory_mb
int
default:"1024"
Maximum memory limit in megabytes for streaming operations
max_compute_units
float
default:"1.0"
Maximum compute units (0.0-1.0 scale) for resource-constrained execution
benchmark_runs
int
default:"5"
Number of runs for benchmark statistical analysis
adaptive_chunk_resize
bool
default:"True"
Enable automatic chunk size reduction when memory limits are exceeded
max_chunk_retries
int
default:"3"
Maximum retry attempts for processing a chunk before failure
spill_to_disk
bool
default:"False"
Enable spilling intermediate results to disk during streaming
output_dir
Path
default:"Path('artifacts')"
Root directory for all pipeline outputs

Methods

ensure_output_dirs

def ensure_output_dirs(self) -> None
Creates the output directory structure for pipeline artifacts.
return
None
Creates directories: output_dir, intermediate, reports, benchmarks, metadata, and profiles
Example:
from pathlib import Path
from pipeline.config import PipelineConfig

config = PipelineConfig(
    random_seed=42,
    chunk_size=256,
    max_memory_mb=2048,
    output_dir=Path('my_pipeline_outputs')
)

config.ensure_output_dirs()
# Creates:
# - my_pipeline_outputs/
# - my_pipeline_outputs/intermediate/
# - my_pipeline_outputs/reports/
# - my_pipeline_outputs/benchmarks/
# - my_pipeline_outputs/metadata/
# - my_pipeline_outputs/profiles/

Usage Examples

Basic Configuration

from pipeline.config import PipelineConfig

# Use default settings
config = PipelineConfig()

print(config.random_seed)  # 42
print(config.chunk_size)   # 128

Resource-Constrained Configuration

from pathlib import Path
from pipeline.config import PipelineConfig

# Configure for low-memory environment
config = PipelineConfig(
    random_seed=123,
    chunk_size=64,
    batch_size=128,
    max_memory_mb=512,
    max_compute_units=0.5,
    adaptive_chunk_resize=True,
    output_dir=Path('limited_resources_run')
)

config.ensure_output_dirs()

High-Performance Configuration

from pathlib import Path
from pipeline.config import PipelineConfig

# Configure for maximum performance
config = PipelineConfig(
    random_seed=42,
    chunk_size=512,
    batch_size=1024,
    n_jobs=-1,  # Use all CPU cores
    max_memory_mb=8192,
    max_compute_units=1.0,
    benchmark_runs=10,
    spill_to_disk=False,
    output_dir=Path('high_performance_run')
)

Debugging Configuration

from pathlib import Path
from pipeline.config import PipelineConfig

# Enable disk spilling and adaptive resizing for troubleshooting
config = PipelineConfig(
    random_seed=42,
    chunk_size=128,
    adaptive_chunk_resize=True,
    max_chunk_retries=5,
    spill_to_disk=True,
    output_dir=Path('debug_run')
)

Notes

  • The dataclass is frozen (immutable) to ensure configuration consistency throughout pipeline execution
  • All pipeline components accept a PipelineConfig instance and respect its settings
  • Resource limits (max_memory_mb, max_compute_units) are soft limits that trigger adaptive behavior rather than hard failures
  • When adaptive_chunk_resize=True, the pipeline automatically reduces chunk sizes if memory limits are exceeded

Build docs developers (and LLMs) love