Overview
The preprocessing module provides functions for cleaning hospital data, handling missing values, standardizing categorical variables, and ensuring consistent data types.Functions
clean_hospital_data
Cleans and standardizes hospital patient data by handling missing values, normalizing gender values, and ensuring numeric columns are properly typed.Raw merged hospital DataFrame to clean. Typically the output from
merge_hospital_data().Cleaned DataFrame with:
- Standardized gender values (“m” or “f”)
- Missing numeric values filled with 0
- Missing test results filled with “unknown”
- Missing diagnoses filled with “unknown”
- All numeric columns converted to proper numeric types
Data Transformations
Theclean_hospital_data function applies the following transformations:
Gender Normalization
Standardizes gender values to single-character codes:| Original | Normalized |
|---|---|
| ”male" | "m" |
| "female" | "f" |
| "man" | "m" |
| "woman" | "f” |
| NaN | ”f” |
Numeric Column Filling
Fills missing values with 0 for the following numeric columns:bmi: Body Mass Indexchildren: Number of childrenmonths: Duration in months
Test Result Filling
Fills missing test results with “unknown” for:blood_test: Blood test resultsecg: Electrocardiogram resultsultrasound: Ultrasound resultsmri: MRI scan resultsxray: X-ray results
Diagnosis Filling
Fills missing diagnoses with “unknown”: Implementation:Numeric Type Conversion
Converts the following columns to numeric types with coercion, filling any conversion errors with 0:age: Patient age in yearsheight: Height measurementweight: Weight measurementbmi: Body Mass Indexchildren: Number of childrenmonths: Duration in months
Constants
NUMERIC_FILL_COLUMNS
List of numeric columns that should be filled with 0 when missing.TEST_COLUMNS
List of test result columns that should be filled with “unknown” when missing.Data Quality
After cleaning, the DataFrame will have:- No missing gender values: All filled with “f” as default
- No missing numeric values: All filled with 0
- No missing test results: All filled with “unknown”
- No missing diagnoses: All filled with “unknown”
- Consistent data types: All numeric columns properly typed as float64
- Standardized categories: Gender values normalized to “m” or “f”