Overview
The Hospital Data Analysis Platform uses a centralized configuration system based on theSystemConfig dataclass. This approach provides type safety, default values, and clear documentation of all configurable parameters.
SystemConfig Dataclass
TheSystemConfig dataclass is defined in config.py:6 and serves as the single source of truth for all system parameters.
Configuration Parameters
Reproducibility
random_seed (int, default: 42)
Global random seed for reproducible experiments. This seed is propagated to Python’s random, NumPy’s random number generator, and the PYTHONHASHSEED environment variable.
Data Splitting
test_size (float, default: 0.25)
Fraction of the dataset reserved for testing. The remaining data is used for training.
File Paths
data_dir (Path, default: ../test)
Directory containing input CSV datasets. Defaults to the test directory relative to the project root.
output_dir (Path, default: artifacts/)
Directory for storing outputs including trained models, benchmark results, and diagnostic reports. The directory is automatically created if it doesn’t exist.
Streaming Parameters
stream_chunk_size (int, default: 16)
Number of records processed per streaming batch. Smaller values reduce memory footprint but increase I/O overhead.
stream_interval_ms (int, default: 10)
Delay in milliseconds between streaming chunks. Used to simulate real-time data ingestion.
Hardware Constraints
hardware_memory_limit_mb (int, default: 256)
Memory budget in megabytes for resource-constrained execution. Used for deployment sizing and constraint validation.
hardware_compute_budget (int, default: 10_000)
Computational budget representing the number of operations or inference cycles allowed. Used for performance profiling.
Benchmarking
benchmark_runs (int, default: 5)
Number of repeated benchmark iterations for statistical confidence. Higher values improve estimate reliability but increase runtime.
confidence_level (float, default: 0.95)
Confidence level for statistical intervals (e.g., 0.95 = 95% confidence). Supported values: 0.90, 0.95, 0.99.
Features and Targets
feature_columns (list[str], default: ["age", "height", "weight", "bmi", "children", "months"])
List of column names used as input features for model training.
target_risk (str, default: "diagnosis")
Column name for the primary risk classification target.
target_outcome (str, default: "blood_test")
Column name for the secondary outcome prediction target.
Experiment Ranges
experiment_memory_limits_mb (list[int], default: [64, 128, 256])
Memory limits to test during resource constraint experiments.
experiment_compute_budgets (list[int], default: [2_000, 5_000, 10_000])
Compute budgets to test during performance experiments.
experiment_stream_speeds_ms (list[int], default: [5, 10, 20])
Streaming intervals to test during latency experiments.
Usage Example
Using Default Configuration
Creating Custom Configuration
Modifying Global Configuration
Best Practices
-
Immutable Defaults: Avoid modifying the global
CONFIGobject across modules. Instead, pass configuration objects as function parameters. -
Path Validation: Always ensure
output_direxists before writing artifacts: -
Seed Propagation: Use the reproducibility utilities to properly set the random seed:
- Environment-Specific Overrides: Use environment variables or configuration files for deployment-specific settings rather than hardcoding values.
See Also
- Reproducibility - Seed management and environment controls
- Benchmarking - Using benchmark configuration parameters
- Troubleshooting - Common configuration issues