Introduction
The NBA Data Preprocessing Pipeline uses a centralized configuration system through thePipelineConfig class. This allows you to control performance, resource usage, and behavior across different deployment environments.
Configuration Methods
You can configure the pipeline in three ways:- Default Configuration - Use built-in defaults by instantiating
PipelineConfig()without arguments - JSON Templates - Load pre-configured templates optimized for specific environments
- CLI Arguments - Override settings via command-line arguments when running the pipeline
Configuration Fields
All configuration options are defined in thePipelineConfig dataclass located at NBA Data Preprocessing/task/pipeline/config.py.
Performance Settings
Number of rows processed per chunk during streaming operations. Smaller values reduce memory usage but may decrease throughput.
Batch size for vectorized operations and feature computations. Should typically be larger than chunk_size for optimal performance.
Number of parallel jobs for multi-threaded operations. Set to -1 to use all available CPU cores.
Resource Limits
Maximum memory allocation in megabytes. The pipeline will adapt processing strategies to stay within this limit.
Relative compute resource allocation (0.0 to 1.0). Used for throttling in resource-constrained environments.
Enable disk spilling when memory limits are reached. Essential for edge devices with limited RAM.
Reliability Settings
Automatically adjust chunk sizes based on memory pressure and processing time. Improves stability under varying conditions.
Maximum number of retry attempts for failed chunk processing before aborting.
Benchmarking
Number of times to repeat benchmark operations for statistical analysis.
Random seed for reproducible results across pipeline runs.
Output
Root directory for all pipeline outputs. Subdirectories are automatically created for:
intermediate/- Intermediate processing resultsreports/- Pipeline execution reportsbenchmarks/- Performance benchmark datametadata/- Dataset and feature metadataprofiles/- Data profiling results
Example: Custom Configuration
Environment-Specific Templates
For common deployment scenarios, use pre-configured templates:- Edge Template (
pipeline.edge.template.json) - Optimized for resource-constrained edge devices - Server Template (
pipeline.server.template.json) - Optimized for high-performance server environments
Next Steps
Configuration Templates
Explore pre-built templates for edge and server deployments
CLI Reference
Learn all command-line options for running the pipeline