Overview
The ingestion module handles loading hospital data from CSV files and creating versioned dataset manifests for reproducibility.Data Loading
load_hospital_data
Loads hospital data from multiple CSV files and returns a dictionary of DataFrames.Directory path containing the hospital CSV files (general.csv, prenatal.csv, sports.csv)
Dictionary mapping hospital names to their respective DataFrames:
"general": General hospital patient data"prenatal": Prenatal care patient data"sports": Sports medicine patient data
data_dir/general.csvdata_dir/prenatal.csvdata_dir/sports.csv
merge_hospital_data
Merges multiple hospital datasets into a single DataFrame with aligned column schemas.Dictionary of DataFrames to merge, typically from
load_hospital_data()Merged DataFrame with:
- All records from all hospitals concatenated vertically
- Column names aligned to the general hospital schema
"Unnamed: 0"column removed if present- Reset index (ignore_index=True)
- Uses the general hospital’s column names as the canonical schema
- All other datasets have their columns renamed to match
- Concatenation uses
pd.concat()withignore_index=True - Automatically removes index columns created by pandas
Dataset Versioning
create_dataset_manifest
Generates a versioned manifest of dataset files with cryptographic checksums.Directory containing the CSV files to catalog
Path where the manifest JSON file will be written (e.g.,
output_dir/dataset_manifest.json)Manifest dictionary containing:
dataset_dir(str): Path to the dataset directoryfiles(list): List of file metadata objects, each with:name(str): Filenamesha256(str): SHA-256 checksum hex digestsize(int): File size in bytes
dataset_manifest.json):
- Dataset versioning for reproducibility
- Detecting data changes between pipeline runs
- Validating data integrity during transfers
- Tracking dataset lineage in experiments
hash_file
Computes SHA-256 checksum for a file.Path to the file to hash
SHA-256 checksum as hexadecimal string