Functions
load_data
Path to the CSV file containing the credit risk dataset
DataFrame containing the loaded dataset
preprocess_data
Input DataFrame containing the credit risk data
Name of the target column. The function expects values “good” (encoded as 1) and “bad” (encoded as 0)
Path to save the fitted preprocessor pipeline. If provided, the preprocessor will be saved as a joblib file
Returns a tuple of
(X_train, X_test, y_train, y_test) containing the preprocessed and split data:- X_train: Training features (80% of data)
- X_test: Test features (20% of data)
- y_train: Training labels
- y_test: Test labels
The function automatically handles:
- Removing residual index columns (
Unnamed: 0) - Encoding target variable:
good→ 1,bad→ 0 - Imputing missing values (mean for numerical, “unknown” for categorical)
- Standardizing numerical features
- One-hot encoding categorical features
- 80/20 train-test split with
random_state=42
Preprocessing Pipeline
The preprocessing pipeline applies different transformations to numerical and categorical features:Numerical Features
Processed features:Age, Credit amount, Duration
- SimpleImputer: Imputes missing values using the mean
- StandardScaler: Normalizes features by removing the mean and scaling to unit variance
Categorical Features
Processed features:Sex, Job, Housing, Saving accounts, Checking account, Purpose
- SimpleImputer: Imputes missing values with “unknown”
- OneHotEncoder: Converts categorical variables into binary vectors with
handle_unknown="ignore"
Source Code Reference
View the complete implementation atprocessing/preprocessor.py:22-106