Data preprocessing, feature scaling, encoding, and train/test splitting
The Preprocessing module provides essential tools for preparing data before machine learning. It includes feature scaling, encoding categorical variables, and data splitting utilities compatible with scikit-learn.
Standardize features by removing mean and scaling to unit variance:
import { StandardScaler } from 'deepbox/preprocess';import { tensor } from 'deepbox/ndarray';const scaler = new StandardScaler();const X_train = tensor([ [1, 2], [2, 3], [3, 4], [4, 5]]);// Fit and transform training datascaler.fit(X_train);const X_train_scaled = scaler.transform(X_train);// Transform test data using training statisticsconst X_test = tensor([[5, 6]]);const X_test_scaled = scaler.transform(X_test);// Or fit and transform in one stepconst X_scaled = scaler.fitTransform(X_train);
Scale features using statistics robust to outliers:
import { RobustScaler } from 'deepbox/preprocess';// Uses median and IQR instead of mean and stdconst scaler = new RobustScaler({ quantileRange: [0.25, 0.75] // IQR});scaler.fit(X_train);const X_scaled = scaler.transform(X_train);
Scale by maximum absolute value (preserves sparsity):
import { MaxAbsScaler } from 'deepbox/preprocess';const scaler = new MaxAbsScaler();scaler.fit(X_train);const X_scaled = scaler.transform(X_train);// All values will be in [-1, 1]
import { Normalizer } from 'deepbox/preprocess';const normalizer = new Normalizer({ norm: 'l2' // 'l1', 'l2', or 'max'});const X_normalized = normalizer.transform(X);// Each row has unit L2 norm
import { StratifiedKFold } from 'deepbox/preprocess';const skfold = new StratifiedKFold({ nSplits: 5, shuffle: true, randomState: 42});for (const { train, test } of skfold.split(X, y)) { // Each fold has same class distribution as original data const X_train_fold = X.gather(train); const y_train_fold = y.gather(train); // ...}
import { GroupKFold } from 'deepbox/preprocess';const groups = [0, 0, 1, 1, 2, 2, 3, 3]; // Group labelsconst gkfold = new GroupKFold({ nSplits: 4 });for (const { train, test } of gkfold.split(X, y, groups)) { // Groups in train and test are disjoint}
import { LeaveOneOut } from 'deepbox/preprocess';const loo = new LeaveOneOut();for (const { train, test } of loo.split(X.shape[0])) { // test contains exactly one sample console.log(train.length, test.length); // n-1, 1}
import { StandardScaler } from 'deepbox/preprocess';import { LogisticRegression } from 'deepbox/ml';const scaler = new StandardScaler();const X_scaled = scaler.fitTransform(X_train);const model = new LogisticRegression();model.fit(X_scaled, y_train);
Categorical Variable Encoding
Convert categories to numbers:
import { OneHotEncoder } from 'deepbox/preprocess';const encoder = new OneHotEncoder();const X_encoded = encoder.fitTransform(X_categorical);
Robust to Outliers
Use robust scaling for data with outliers:
import { RobustScaler } from 'deepbox/preprocess';const scaler = new RobustScaler();const X_scaled = scaler.fitTransform(X_with_outliers);