Data Cleaning
Theclean_hospital_data() function performs comprehensive data cleaning operations on the merged hospital dataset.
Function Signature
df(pd.DataFrame): Raw hospital data from the ingestion pipeline
- Cleaned DataFrame with standardized values and proper data types
Cleaning Operations
The cleaning process applies several transformations:1. Gender Standardization
Normalizes gender values to single-letter codes:- Maps “male” and “man” → “m”
- Maps “female” and “woman” → “f”
- Fills missing values with “f”
2. Numeric Column Imputation
Fills missing values with 0 for specific numeric columns:bmi- Body Mass Indexchildren- Number of childrenmonths- Duration in months
3. Medical Test Results
Fills missing test results with “unknown”:blood_testecg(electrocardiogram)ultrasoundmri(magnetic resonance imaging)xray
4. Diagnosis Imputation
Fills missing diagnosis values:5. Type Conversion and Coercion
Converts columns to numeric types with error handling:- Uses
errors="coerce"to convert invalid values to NaN - Fills resulting NaN values with 0
- Ensures all values are numeric (float or int)
ageheightweightbmichildrenmonths