Skip to main content
Reproducibility is enforced through three mechanisms: global seed control, configuration-driven training, and cryptographic lineage tracking. Every training run produces verifiable artifacts that can be reproduced bit-for-bit.

Seed Control

All random operations are controlled by a global seed defined in config.yaml.

Configuration

config.yaml
seed: 42

Implementation

The seed is set globally before any random operations:
src/data.py
def set_global_seed(seed: int) -> None:
    random.seed(seed)
    np.random.seed(seed)
All training operations initialize from this seed:
src/train.py
def main() -> None:
    config = load_config()
    set_global_seed(int(config["seed"]))
    
    # All subsequent operations use seeded randomness
    df = load_dataset(config)
    X_train, X_test, y_train, y_test = split_data(df, config)

Seeded Components

The seed propagates to all stochastic operations:
return train_test_split(
    X,
    y,
    test_size=float(config["data"]["test_size"]),
    random_state=int(config["seed"]),
    stratify=y,
)

Configuration-Driven Training

All training behavior is controlled by declarative configuration, eliminating hard-coded parameters.

Complete Configuration

config.yaml
seed: 42

data:
  path: ml_datasource.csv
  target: purchased
  test_size: 0.2

features:
  epsilon: 1.0e-06
  engagement:
    minutes_watched_weight: 0.6
    days_on_platform_weight: 0.3
    courses_started_weight: 10.0

preprocessing:
  outlier_factor: 1.5
  numeric_imputer: median
  categorical_imputer: most_frequent

models:
  logistic_regression:
    max_iter: 2000
  knn:
    n_neighbors: 7
  svm:
    C: 1.0
    kernel: rbf
    gamma: scale
  decision_tree:
    max_depth: 8
    min_samples_leaf: 10
  random_forest:
    n_estimators: 400
    min_samples_leaf: 2

cv:
  n_splits: 5

business:
  target_precision: 0.9

artifacts:
  model_dir: artifacts
  model_file: best_model.joblib
  threshold_file: threshold.txt
  metrics_file: metrics.json
  drift_baseline_file: drift_baseline.json
  lineage_file: lineage.json

Configuration Loading

Configuration is loaded once and passed through the pipeline:
src/data.py
def load_config(path: str | Path = "config.yaml") -> dict:
    with open(path, "r", encoding="utf-8") as f:
        return yaml.safe_load(f)

Configuration Validation

The configuration structure is validated at runtime through type checking and bounds validation in model initialization.

Lineage Tracking

Every training run generates a lineage manifest with SHA256 hashes of all inputs and outputs.

Lineage Generation

The training script computes hashes of all artifacts:
src/train.py
def _sha256_bytes(data: bytes) -> str:
    return hashlib.sha256(data).hexdigest()

def _sha256_file(path: Path) -> str:
    return _sha256_bytes(path.read_bytes())

# After training completes
run_id = str(uuid.uuid4())
config_hash = _sha256_file(Path("config.yaml"))
dataset_hash = _sha256_file(Path(config["data"]["path"]))
model_hash = _sha256_file(model_path)

lineage = {
    "run_id": run_id,
    "dataset": {
        "path": config["data"]["path"],
        "sha256": dataset_hash,
    },
    "config": {
        "path": "config.yaml",
        "sha256": config_hash,
    },
    "model": {
        "path": str(model_path),
        "sha256": model_hash,
    },
    "threshold": {
        "path": str(threshold_path),
        "sha256": _sha256_file(threshold_path),
    },
}
lineage_path.write_text(json.dumps(lineage, indent=2), encoding="utf-8")

Lineage Structure

The lineage file (artifacts/lineage.json) records complete provenance:
artifacts/lineage.json
{
  "run_id": "a3f2b891-4c5d-4e2f-9a1b-8c3d5e6f7a8b",
  "dataset": {
    "path": "ml_datasource.csv",
    "sha256": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
  },
  "config": {
    "path": "config.yaml",
    "sha256": "d4f5e6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c3d4e5"
  },
  "model": {
    "path": "artifacts/best_model.joblib",
    "sha256": "a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2"
  },
  "threshold": {
    "path": "artifacts/threshold.txt",
    "sha256": "b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c3"
  }
}

Reproducibility Verification

The reproducibility check script validates that artifacts match their lineage hashes.

Running the Check

python scripts/reproducibility_check.py

Implementation

scripts/reproducibility_check.py
def sha256(path: Path) -> str:
    return hashlib.sha256(path.read_bytes()).hexdigest()

def main() -> None:
    lineage_path = Path("artifacts/lineage.json")
    if not lineage_path.exists():
        raise FileNotFoundError("Missing artifacts/lineage.json. Run training first.")
    
    lineage = json.loads(lineage_path.read_text(encoding="utf-8"))
    checks = {
        "dataset": (Path(lineage["dataset"]["path"]), lineage["dataset"]["sha256"]),
        "config": (Path(lineage["config"]["path"]), lineage["config"]["sha256"]),
        "model": (Path(lineage["model"]["path"]), lineage["model"]["sha256"]),
        "threshold": (Path(lineage["threshold"]["path"]), lineage["threshold"]["sha256"]),
    }
    
    report = {"run_id": lineage.get("run_id"), "checks": {}}
    all_passed = True
    for name, (path, expected) in checks.items():
        actual = sha256(path)
        passed = actual == expected
        all_passed &= passed
        report["checks"][name] = {
            "path": str(path),
            "expected": expected,
            "actual": actual,
            "passed": passed,
        }
    
    report["passed"] = all_passed
    
    Path("artifacts/reproducibility_report.json").write_text(json.dumps(report, indent=2), encoding="utf-8")
    print(json.dumps(report, indent=2))
    
    if not all_passed:
        raise SystemExit(1)

Verification Report

The script generates artifacts/reproducibility_report.json:
artifacts/reproducibility_report.json
{
  "run_id": "a3f2b891-4c5d-4e2f-9a1b-8c3d5e6f7a8b",
  "checks": {
    "dataset": {
      "path": "ml_datasource.csv",
      "expected": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
      "actual": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
      "passed": true
    },
    "config": {
      "path": "config.yaml",
      "expected": "d4f5e6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c3d4e5",
      "actual": "d4f5e6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c3d4e5",
      "passed": true
    },
    "model": {
      "path": "artifacts/best_model.joblib",
      "expected": "a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2",
      "actual": "a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2",
      "passed": true
    },
    "threshold": {
      "path": "artifacts/threshold.txt",
      "expected": "b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c3",
      "actual": "b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c3",
      "passed": true
    }
  },
  "passed": true
}

Environment Tracking

The script also captures environment metadata:
scripts/reproducibility_check.py
environment = {
    "python_version": sys.version,
    "platform": platform.platform(),
    "cwd": str(Path.cwd()),
}

Path("artifacts/reproducibility_environment.json").write_text(json.dumps(environment, indent=2), encoding="utf-8")

Reproducibility Guarantees

Given identical inputs (dataset + config), training produces:
  • Identical model parameters (verified by SHA256)
  • Identical threshold values
  • Identical cross-validation splits
  • Identical metric values

Best Practices

1

Version Configuration

Commit config.yaml to version control alongside code
2

Track Dataset Versions

Store dataset hashes in config/datasets.yaml and validate before training
3

Verify Lineage

Run python scripts/reproducibility_check.py in CI to detect accidental changes
4

Archive Artifacts

Store lineage manifests with model artifacts for audit trails
5

Document Seed Changes

Changing the seed produces different models - document why in commit messages

CI Integration

Add reproducibility checks to CI pipelines:
.github/workflows/ci.yml
- name: Train model
  run: python -m src.train

- name: Verify reproducibility
  run: python scripts/reproducibility_check.py

- name: Archive lineage
  uses: actions/upload-artifact@v3
  with:
    name: lineage-${{ github.sha }}
    path: artifacts/lineage.json

Limitations

Reproducibility is not guaranteed when:
  • Using non-deterministic hardware (GPU with non-deterministic ops)
  • Parallel execution order varies (set n_jobs=1 for strict reproducibility)
  • Python/library versions differ
  • System-level randomness is not controlled

Build docs developers (and LLMs) love