Overview
The Hospital Data Analysis Platform implements comprehensive reproducibility controls to ensure consistent results across different executions and environments. This is critical for validating model behavior, debugging issues, and satisfying regulatory requirements.Seed Management
Setting the Global Seed
Theset_global_seed function (defined in utils/reproducibility.py:16) initializes all random number generators with a consistent seed:
Implementation Details
- Sets threading environment variables to ensure deterministic numeric library behavior
- Seeds Python’s random module for standard library randomness
- Seeds NumPy’s random generator for array operations
- Sets PYTHONHASHSEED to ensure deterministic hash-based operations (dict ordering, set operations)
Environment Variables
Threading Controls
The_set_default_threading_env function (defined in utils/reproducibility.py:10) configures numeric library threading to eliminate non-determinism:
1 ensures deterministic execution.
Environment Variable Reference
| Variable | Purpose | Default Value |
|---|---|---|
OMP_NUM_THREADS | OpenMP thread pool size | "1" |
MKL_NUM_THREADS | Intel MKL thread count | "1" |
OPENBLAS_NUM_THREADS | OpenBLAS thread count | "1" |
PYTHONHASHSEED | Hash randomization seed | Set to CONFIG.random_seed |
setdefault, so user-provided values are respected.
Reproducibility Context
Capturing Execution Context
Thereproducibility_context function (defined in utils/reproducibility.py:23) captures a snapshot of the execution environment:
Example Output
Use Cases
- Logging: Include context in experiment logs for audit trails
- Debugging: Compare contexts when results differ across environments
- Artifacts: Save context alongside model checkpoints
Complete Reproducibility Workflow
Example: Reproducible Training Script
Example: Cross-Environment Validation
Best Practices
1. Set Seed Early
Always callset_global_seed before any operations that use randomness:
2. Capture Context with Artifacts
Always save the reproducibility context alongside model artifacts:3. Document Environment Changes
If you need to override threading for performance reasons, document it clearly:4. Version Control Configuration
Store configuration files in version control to track changes:Limitations
Known Sources of Non-Determinism
- Floating-point operations: Different CPU architectures may produce slightly different results due to precision differences
- Multi-threading: Even with thread count = 1, some libraries may still use threading internally
- System libraries: OS-level randomness (e.g.,
/dev/urandom) is not controlled by these utilities - External services: API calls, database queries, and network operations are inherently non-deterministic
Verification Strategy
To verify reproducibility:- Run the same script twice with the same seed
- Compare outputs using checksums or exact equality checks
- If results differ, investigate numeric precision, library versions, or system dependencies
See Also
- Configuration - System configuration and random seed settings
- Benchmarking - Reproducible performance measurements
- Troubleshooting - Debugging reproducibility issues