Preprocessor

The preprocessor module handles loading and preprocessing credit risk data for model training and inference.

Functions

load_data

def load_data(filepath: str) -> pd.DataFrame

Loads a dataset from the specified file path.

filepath

str

required

Path to the CSV file containing the credit risk dataset

return

pd.DataFrame

DataFrame containing the loaded dataset

import pandas as pd
from processing.preprocessor import load_data

# Load the dataset
df = load_data("datasets/credit_score_dataset/german_credit_risk.csv")
print(df.head())

preprocess_data

def preprocess_data(
    df: pd.DataFrame,
    target_column: str = "Risk",
    save_path: str = None
)

Preprocesses the credit risk dataset by handling missing values, encoding categorical features, and splitting data into training and test sets.

pd.DataFrame

required

Input DataFrame containing the credit risk data

target_column

str

default:"Risk"

Name of the target column. The function expects values “good” (encoded as 1) and “bad” (encoded as 0)

save_path

str

Path to save the fitted preprocessor pipeline. If provided, the preprocessor will be saved as a joblib file

return

tuple

Returns a tuple of (X_train, X_test, y_train, y_test) containing the preprocessed and split data:

X_train: Training features (80% of data)
X_test: Test features (20% of data)
y_train: Training labels
y_test: Test labels

The function automatically handles:

Removing residual index columns (Unnamed: 0)
Encoding target variable: good → 1, bad → 0
Imputing missing values (mean for numerical, “unknown” for categorical)
Standardizing numerical features
One-hot encoding categorical features
80/20 train-test split with random_state=42

Preprocessing Pipeline

The preprocessing pipeline applies different transformations to numerical and categorical features:

Numerical Features

Processed features: Age, Credit amount, Duration

SimpleImputer: Imputes missing values using the mean
StandardScaler: Normalizes features by removing the mean and scaling to unit variance

Categorical Features

Processed features: Sex, Job, Housing, Saving accounts, Checking account, Purpose

SimpleImputer: Imputes missing values with “unknown”
OneHotEncoder: Converts categorical variables into binary vectors with handle_unknown="ignore"

import pandas as pd
from processing.preprocessor import load_data, preprocess_data

# Load the dataset
df = load_data("datasets/credit_score_dataset/german_credit_risk.csv")

# Preprocess and split the data
X_train, X_test, y_train, y_test = preprocess_data(
    df,
    target_column="Risk",
    save_path="models/preprocessor.joblib"
)

print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")

Source Code Reference

View the complete implementation at processing/preprocessor.py:22-106

Inference API

Model Architecture

Data Processing

Training

Functions

load_data

preprocess_data

Preprocessing Pipeline

Numerical Features

Categorical Features

Source Code Reference

Build docs developers (and LLMs) love

Inference API

Model Architecture

Data Processing

Training

​Functions

​load_data

​preprocess_data

​Preprocessing Pipeline

​Numerical Features

​Categorical Features

​Source Code Reference

Build docs developers (and LLMs) love

Functions

load_data

preprocess_data

Preprocessing Pipeline

Numerical Features

Categorical Features

Source Code Reference