Reproducibility Mechanisms

Reproducibility is enforced through three mechanisms: global seed control, configuration-driven training, and cryptographic lineage tracking. Every training run produces verifiable artifacts that can be reproduced bit-for-bit.

Seed Control

All random operations are controlled by a global seed defined in config.yaml.

Configuration

config.yaml

seed: 42

Implementation

The seed is set globally before any random operations:

src/data.py

def set_global_seed(seed: int) -> None:
    random.seed(seed)
    np.random.seed(seed)

All training operations initialize from this seed:

src/train.py

def main() -> None:
    config = load_config()
    set_global_seed(int(config["seed"]))
    
    # All subsequent operations use seeded randomness
    df = load_dataset(config)
    X_train, X_test, y_train, y_test = split_data(df, config)

Seeded Components

The seed propagates to all stochastic operations:

return train_test_split(
    X,
    y,
    test_size=float(config["data"]["test_size"]),
    random_state=int(config["seed"]),
    stratify=y,
)

Configuration-Driven Training

All training behavior is controlled by declarative configuration, eliminating hard-coded parameters.

Complete Configuration

config.yaml

seed: 42

data:
  path: ml_datasource.csv
  target: purchased
  test_size: 0.2

features:
  epsilon: 1.0e-06
  engagement:
    minutes_watched_weight: 0.6
    days_on_platform_weight: 0.3
    courses_started_weight: 10.0

preprocessing:
  outlier_factor: 1.5
  numeric_imputer: median
  categorical_imputer: most_frequent

models:
  logistic_regression:
    max_iter: 2000
  knn:
    n_neighbors: 7
  svm:
    C: 1.0
    kernel: rbf
    gamma: scale
  decision_tree:
    max_depth: 8
    min_samples_leaf: 10
  random_forest:
    n_estimators: 400
    min_samples_leaf: 2

cv:
  n_splits: 5

business:
  target_precision: 0.9

artifacts:
  model_dir: artifacts
  model_file: best_model.joblib
  threshold_file: threshold.txt
  metrics_file: metrics.json
  drift_baseline_file: drift_baseline.json
  lineage_file: lineage.json

Configuration Loading

Configuration is loaded once and passed through the pipeline:

src/data.py

def load_config(path: str | Path = "config.yaml") -> dict:
    with open(path, "r", encoding="utf-8") as f:
        return yaml.safe_load(f)

Configuration Validation

The configuration structure is validated at runtime through type checking and bounds validation in model initialization.

Lineage Tracking

Every training run generates a lineage manifest with SHA256 hashes of all inputs and outputs.

Lineage Generation

The training script computes hashes of all artifacts:

src/train.py

def _sha256_bytes(data: bytes) -> str:
    return hashlib.sha256(data).hexdigest()

def _sha256_file(path: Path) -> str:
    return _sha256_bytes(path.read_bytes())

# After training completes
run_id = str(uuid.uuid4())
config_hash = _sha256_file(Path("config.yaml"))
dataset_hash = _sha256_file(Path(config["data"]["path"]))
model_hash = _sha256_file(model_path)

lineage = {
    "run_id": run_id,
    "dataset": {
        "path": config["data"]["path"],
        "sha256": dataset_hash,
    },
    "config": {
        "path": "config.yaml",
        "sha256": config_hash,
    },
    "model": {
        "path": str(model_path),
        "sha256": model_hash,
    },
    "threshold": {
        "path": str(threshold_path),
        "sha256": _sha256_file(threshold_path),
    },
}
lineage_path.write_text(json.dumps(lineage, indent=2), encoding="utf-8")

Lineage Structure

The lineage file (artifacts/lineage.json) records complete provenance:

artifacts/lineage.json

{
  "run_id": "a3f2b891-4c5d-4e2f-9a1b-8c3d5e6f7a8b",
  "dataset": {
    "path": "ml_datasource.csv",
    "sha256": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
  },
  "config": {
    "path": "config.yaml",
    "sha256": "d4f5e6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c3d4e5"
  },
  "model": {
    "path": "artifacts/best_model.joblib",
    "sha256": "a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2"
  },
  "threshold": {
    "path": "artifacts/threshold.txt",
    "sha256": "b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c3"
  }
}

Reproducibility Verification

The reproducibility check script validates that artifacts match their lineage hashes.

Running the Check

python scripts/reproducibility_check.py

Implementation

scripts/reproducibility_check.py

def sha256(path: Path) -> str:
    return hashlib.sha256(path.read_bytes()).hexdigest()

def main() -> None:
    lineage_path = Path("artifacts/lineage.json")
    if not lineage_path.exists():
        raise FileNotFoundError("Missing artifacts/lineage.json. Run training first.")
    
    lineage = json.loads(lineage_path.read_text(encoding="utf-8"))
    checks = {
        "dataset": (Path(lineage["dataset"]["path"]), lineage["dataset"]["sha256"]),
        "config": (Path(lineage["config"]["path"]), lineage["config"]["sha256"]),
        "model": (Path(lineage["model"]["path"]), lineage["model"]["sha256"]),
        "threshold": (Path(lineage["threshold"]["path"]), lineage["threshold"]["sha256"]),
    }
    
    report = {"run_id": lineage.get("run_id"), "checks": {}}
    all_passed = True
    for name, (path, expected) in checks.items():
        actual = sha256(path)
        passed = actual == expected
        all_passed &= passed
        report["checks"][name] = {
            "path": str(path),
            "expected": expected,
            "actual": actual,
            "passed": passed,
        }
    
    report["passed"] = all_passed
    
    Path("artifacts/reproducibility_report.json").write_text(json.dumps(report, indent=2), encoding="utf-8")
    print(json.dumps(report, indent=2))
    
    if not all_passed:
        raise SystemExit(1)

Verification Report

The script generates artifacts/reproducibility_report.json:

artifacts/reproducibility_report.json

{
  "run_id": "a3f2b891-4c5d-4e2f-9a1b-8c3d5e6f7a8b",
  "checks": {
    "dataset": {
      "path": "ml_datasource.csv",
      "expected": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
      "actual": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
      "passed": true
    },
    "config": {
      "path": "config.yaml",
      "expected": "d4f5e6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c3d4e5",
      "actual": "d4f5e6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c3d4e5",
      "passed": true
    },
    "model": {
      "path": "artifacts/best_model.joblib",
      "expected": "a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2",
      "actual": "a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2",
      "passed": true
    },
    "threshold": {
      "path": "artifacts/threshold.txt",
      "expected": "b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c3",
      "actual": "b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c3",
      "passed": true
    }
  },
  "passed": true
}

Environment Tracking

The script also captures environment metadata:

scripts/reproducibility_check.py

environment = {
    "python_version": sys.version,
    "platform": platform.platform(),
    "cwd": str(Path.cwd()),
}

Path("artifacts/reproducibility_environment.json").write_text(json.dumps(environment, indent=2), encoding="utf-8")

Reproducibility Guarantees

Given identical inputs (dataset + config), training produces:

Identical model parameters (verified by SHA256)
Identical threshold values
Identical cross-validation splits
Identical metric values

Best Practices

Version Configuration

Commit config.yaml to version control alongside code

Track Dataset Versions

Store dataset hashes in config/datasets.yaml and validate before training

Verify Lineage

Run python scripts/reproducibility_check.py in CI to detect accidental changes

Archive Artifacts

Store lineage manifests with model artifacts for audit trails

Document Seed Changes

Changing the seed produces different models - document why in commit messages

CI Integration

Add reproducibility checks to CI pipelines:

.github/workflows/ci.yml

- name: Train model
  run: python -m src.train

- name: Verify reproducibility
  run: python scripts/reproducibility_check.py

- name: Archive lineage
  uses: actions/upload-artifact@v3
  with:
    name: lineage-${{ github.sha }}
    path: artifacts/lineage.json

Limitations

Reproducibility is not guaranteed when:

Using non-deterministic hardware (GPU with non-deterministic ops)
Parallel execution order varies (set n_jobs=1 for strict reproducibility)
Python/library versions differ
System-level randomness is not controlled

Getting Started

Core Concepts

Training

Deployment

Optimization

Runtime

Seed Control

Configuration

Implementation

Seeded Components

Configuration-Driven Training

Complete Configuration

Configuration Loading

Configuration Validation

Lineage Tracking

Lineage Generation

Lineage Structure

Reproducibility Verification

Running the Check

Implementation

Verification Report

Environment Tracking

Reproducibility Guarantees

Best Practices

CI Integration

Limitations

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Training

Deployment

Optimization

Runtime

​Seed Control

​Configuration

​Implementation

​Seeded Components

​Configuration-Driven Training

​Complete Configuration

​Configuration Loading

​Configuration Validation

​Lineage Tracking

​Lineage Generation

​Lineage Structure

​Reproducibility Verification

​Running the Check

​Implementation

​Verification Report

​Environment Tracking

​Reproducibility Guarantees

​Best Practices

​CI Integration

​Limitations

​Related Documentation

Build docs developers (and LLMs) love

Seed Control

Configuration

Implementation

Seeded Components

Configuration-Driven Training

Complete Configuration

Configuration Loading

Configuration Validation

Lineage Tracking

Lineage Generation

Lineage Structure

Reproducibility Verification

Running the Check

Implementation

Verification Report

Environment Tracking

Reproducibility Guarantees

Best Practices

CI Integration

Limitations

Related Documentation