Loading Hospital Data
Theload_hospital_data() function loads CSV files from three hospital departments: general, prenatal, and sports.
Function Signature
data_dir(Path): Directory containing the hospital CSV files
- Dictionary mapping hospital names to DataFrames
Supported Hospital Files
The loader expects these CSV files in the data directory:general.csv- General hospital dataprenatal.csv- Prenatal care datasports.csv- Sports medicine data
Usage Example
Merging Datasets
Themerge_hospital_data() function combines multiple hospital datasets into a single DataFrame by aligning column names and concatenating rows.
Function Signature
datasets(dict[str, pd.DataFrame]): Dictionary of hospital DataFrames fromload_hospital_data()
- Merged DataFrame with aligned columns and reset index
Merging Behavior
- Uses the
generalhospital’s columns as the reference schema - Renames all other datasets’ columns to match the general schema
- Concatenates all datasets vertically with
ignore_index=True - Removes the
Unnamed: 0column if present (common pandas artifact)
Usage Example
Dataset Versioning
The versioning module provides dataset integrity tracking through SHA-256 file hashing and JSON manifests.Creating a Dataset Manifest
data_dir(Path): Directory containing CSV files to trackoutput_path(Path): Path where the manifest JSON will be saved
- Dictionary containing dataset metadata
Manifest Structure
The generated manifest includes:File Hashing
Thehash_file() function computes SHA-256 checksums:
path(Path): Path to the file to hash
- Hexadecimal SHA-256 hash string