Seed Control
All random operations are controlled by a global seed defined inconfig.yaml.
Configuration
config.yaml
Implementation
The seed is set globally before any random operations:src/data.py
src/train.py
Seeded Components
The seed propagates to all stochastic operations:Configuration-Driven Training
All training behavior is controlled by declarative configuration, eliminating hard-coded parameters.Complete Configuration
config.yaml
Configuration Loading
Configuration is loaded once and passed through the pipeline:src/data.py
Configuration Validation
The configuration structure is validated at runtime through type checking and bounds validation in model initialization.Lineage Tracking
Every training run generates a lineage manifest with SHA256 hashes of all inputs and outputs.Lineage Generation
The training script computes hashes of all artifacts:src/train.py
Lineage Structure
The lineage file (artifacts/lineage.json) records complete provenance:
artifacts/lineage.json
Reproducibility Verification
The reproducibility check script validates that artifacts match their lineage hashes.Running the Check
Implementation
scripts/reproducibility_check.py
Verification Report
The script generatesartifacts/reproducibility_report.json:
artifacts/reproducibility_report.json
Environment Tracking
The script also captures environment metadata:scripts/reproducibility_check.py
Reproducibility Guarantees
Given identical inputs (dataset + config), training produces:
- Identical model parameters (verified by SHA256)
- Identical threshold values
- Identical cross-validation splits
- Identical metric values
Best Practices
CI Integration
Add reproducibility checks to CI pipelines:.github/workflows/ci.yml