Skip to main content
Encoders transform categorical features (strings, labels) into numeric representations suitable for machine learning.

LabelEncoder

Encode target labels with values between 0 and n_classes-1. Time Complexity:
  • fit: O(n) where n is number of samples
  • transform: O(n) with O(1) lookup per sample
Space Complexity: O(k) where k is number of unique classes
import { LabelEncoder } from 'deepbox/preprocess';
import { tensor } from 'deepbox/ndarray';

const y = tensor(['cat', 'dog', 'cat', 'bird']);
const encoder = new LabelEncoder();
encoder.fit(y);
const yEncoded = encoder.transform(y);  // [1, 2, 1, 0]
const yDecoded = encoder.inverseTransform(yEncoded); // ['cat', 'dog', 'cat', 'bird']

Methods

fit
(y: Tensor | Array) => this
Learn unique classes from labels.Parameters:
  • y - Target labels (1D tensor or array of strings/numbers)
Returns: self for method chaining
transform
(y: Tensor | Array) => Tensor
Transform labels to normalized encoding [0, n_classes-1].Returns: Integer tensor with encoded labelsThrows: InvalidParameterError if label not seen during fit
fitTransform
(y: Tensor | Array) => Tensor
Fit to data and transform in one step.
inverseTransform
(y: Tensor | Array) => Tensor
Transform integer labels back to original encoding.

Attributes

After fitting:
  • classes_ - Unique classes in sorted order
  • classToIndex_ - Map from class to integer index

OneHotEncoder

Encode categorical features as one-hot numeric array. Time Complexity:
  • fit: O(n*m) where n=samples, m=features
  • transform: O(nmk) where k=avg categories per feature
Space Complexity:
  • Dense: O(n * Σk_i) where k_i is unique categories for feature i
  • Sparse: O(nnz) number of non-zero elements
import { OneHotEncoder } from 'deepbox/preprocess';
import { tensor } from 'deepbox/ndarray';

const X = tensor([['red', 'S'], ['blue', 'M'], ['red', 'L']]);
const encoder = new OneHotEncoder({ sparse: false });
encoder.fit(X);
const encoded = encoder.transform(X);
// Result: [[1,0,1,0,0], [0,1,0,1,0], [1,0,0,0,1]]

Constructor

new OneHotEncoder(options?: {
  sparse?: boolean;                     // Return CSRMatrix if true (default: false)
  sparseOutput?: boolean;               // Alias for sparse
  handleUnknown?: 'error' | 'ignore';   // How to handle unknown categories (default: 'error')
  drop?: 'first' | 'if_binary' | null;  // Drop policy to avoid collinearity
  categories?: 'auto' | Category[][];   // Explicit categories per feature
})

Methods

fit
(X: Tensor | Array[][]) => this
Learn unique categories for each feature.Parameters:
  • X - Training data (2D tensor or array)
Returns: self for method chaining
transform
(X: Tensor | Array[][]) => Tensor | CSRMatrix
Transform categorical features to one-hot encoding.Returns: Binary matrix (dense Tensor or sparse CSRMatrix)
  • sparse=false: Returns dense Tensor
  • sparse=true: Returns CSRMatrix for memory efficiency
fitTransform
(X: Tensor | Array[][]) => Tensor | CSRMatrix
Fit and transform in one step.
inverseTransform
(X: Tensor | CSRMatrix) => Tensor
Transform one-hot encoding back to original categories.

Attributes

After fitting:
  • categories_ - Unique categories for each feature
  • dropIndices_ - Index of dropped category per feature (if drop is set)

OrdinalEncoder

Encode categorical features as integer array. Time Complexity:
  • fit: O(nmlog(k)) where n=samples, m=features, k=avg categories
  • transform: O(n*m) with O(1) map lookup
Space Complexity: O(m*k) where m=features, k=avg categories per feature
import { OrdinalEncoder } from 'deepbox/preprocess';

const X = tensor([['low', 'red'], ['high', 'blue'], ['medium', 'red']]);
const encoder = new OrdinalEncoder();
encoder.fit(X);
const encoded = encoder.transform(X);
// Result: [[1, 1], [0, 0], [2, 1]] (alphabetically sorted)

Constructor

new OrdinalEncoder(options?: {
  handleUnknown?: 'error' | 'useEncodedValue';  // How to handle unknown categories
  unknownValue?: number;                        // Value for unknown (default: -1)
  categories?: 'auto' | Category[][];           // Explicit categories per feature
})

Methods

fit
(X: Tensor | Array[][]) => this
Learn unique categories and their ordering for each feature.
transform
(X: Tensor | Array[][]) => Tensor
Transform categorical features to ordinal integers [0, n_categories-1].
fitTransform
(X: Tensor | Array[][]) => Tensor
Fit and transform in one step.
inverseTransform
(X: Tensor | Array[][]) => Tensor
Transform ordinal integers back to original categories.

Attributes

After fitting:
  • categories_ - Sorted unique categories for each feature
  • categoryToIndex_ - Map from category to index for each feature

LabelBinarizer

Binarize labels in a one-vs-all fashion. Time Complexity:
  • fit: O(n) where n is number of samples
  • transform: O(n*k) where k is number of classes
Space Complexity: O(n*k) for the output matrix
import { LabelBinarizer } from 'deepbox/preprocess';

const y = tensor([0, 1, 2, 0, 1]);
const binarizer = new LabelBinarizer();
const yBin = binarizer.fitTransform(y);
// Result shape: [5, 3] with one-hot encoding

Constructor

new LabelBinarizer(options?: {
  posLabel?: number;       // Value for positive class (default: 1)
  negLabel?: number;       // Value for negative class (default: 0)
  sparse?: boolean;        // Return CSRMatrix if true (default: false)
  sparseOutput?: boolean;  // Alias for sparse
})

Methods

fit
(y: Tensor | Array) => this
Learn unique classes from labels.
transform
(y: Tensor | Array) => Tensor | CSRMatrix
Transform labels to binary matrix.Each label is converted to a binary vector with:
  • posLabel (default 1) at the class position
  • negLabel (default 0) elsewhere
fitTransform
(y: Tensor | Array) => Tensor | CSRMatrix
Fit and transform in one step.
inverseTransform
(Y: Tensor | CSRMatrix) => Tensor
Transform binary matrix back to labels.Finds the column with maximum value for each row.

Attributes

After fitting:
  • classes_ - Unique classes in sorted order

MultiLabelBinarizer

Transform multi-label classification data to binary format. Handles cases where each sample can belong to multiple classes simultaneously. Time Complexity:
  • fit: O(n*k) where n=samples, k=avg labels per sample
  • transform: O(nkc) where c=total unique classes
Space Complexity: O(n*c) for the output matrix
import { MultiLabelBinarizer } from 'deepbox/preprocess';

const y = [['sci-fi', 'action'], ['comedy'], ['action', 'drama']];
const binarizer = new MultiLabelBinarizer();
const yBin = binarizer.fitTransform(y);
// Each row can have multiple 1s

Constructor

new MultiLabelBinarizer(options?: {
  sparse?: boolean;             // Return CSRMatrix if true (default: false)
  sparseOutput?: boolean;       // Alias for sparse
  classes?: Category[];         // Explicit class ordering
})

Methods

fit
(y: Category[][]) => this
Learn all unique classes across all samples.Parameters:
  • y - Array of label sets (each element is an array of labels)
transform
(y: Category[][]) => Tensor | CSRMatrix
Transform label sets to binary matrix.Each row can have multiple 1s (one per active label).
fitTransform
(y: Category[][]) => Tensor | CSRMatrix
Fit and transform in one step.
inverseTransform
(Y: Tensor | CSRMatrix) => Category[][]
Transform binary matrix back to label sets.Finds all active (1) columns for each row.Returns: Array of label sets (one per sample)

Attributes

After fitting:
  • classes_ - All unique classes in sorted order

Type Definitions

// Category values can be strings, numbers, or bigints
type Category = string | number | bigint;

// Encoder input types
type EncoderInput1D = Tensor | readonly (string | number | bigint | boolean)[];
type EncoderInput2D = Tensor | readonly (readonly (string | number | bigint)[])[];

Examples

Text Label Encoding

import { LabelEncoder } from 'deepbox/preprocess';
import { tensor } from 'deepbox/ndarray';

const labels = tensor(['positive', 'negative', 'neutral', 'positive', 'negative']);
const encoder = new LabelEncoder();
const encoded = encoder.fitTransform(labels);
// [2, 0, 1, 2, 0] (alphabetically sorted)

const decoded = encoder.inverseTransform(encoded);
// ['positive', 'negative', 'neutral', 'positive', 'negative']

Multi-Feature One-Hot Encoding

import { OneHotEncoder } from 'deepbox/preprocess';
import { tensor } from 'deepbox/ndarray';

const features = tensor([
  ['red', 'small'],
  ['blue', 'large'],
  ['red', 'medium'],
  ['green', 'small']
]);

const encoder = new OneHotEncoder({ sparse: false });
const encoded = encoder.fitTransform(features);
// Shape: [4, 5] - one column per unique value across all features

const original = encoder.inverseTransform(encoded);
// Returns original categorical data

Multi-Label Classification

import { MultiLabelBinarizer } from 'deepbox/preprocess';

const movieGenres = [
  ['action', 'sci-fi'],
  ['comedy', 'romance'],
  ['action', 'thriller'],
  ['sci-fi']
];

const binarizer = new MultiLabelBinarizer();
const encoded = binarizer.fitTransform(movieGenres);
// Shape: [4, 5] - columns for: action, comedy, romance, sci-fi, thriller
// Each row can have multiple 1s

const decoded = binarizer.inverseTransform(encoded);
// Returns original label sets

Ordinal Encoding with Unknown Handling

import { OrdinalEncoder } from 'deepbox/preprocess';
import { tensor } from 'deepbox/ndarray';

const sizes = tensor([['S'], ['M'], ['L'], ['XL']]);
const encoder = new OrdinalEncoder({
  handleUnknown: 'useEncodedValue',
  unknownValue: -1
});
encoder.fit(sizes);

const testSizes = tensor([['M'], ['XXL'], ['S']]);
const encoded = encoder.transform(testSizes);
// [1, -1, 0] - 'XXL' is encoded as -1 (unknown)

Sparse Encoding for Memory Efficiency

import { OneHotEncoder } from 'deepbox/preprocess';
import { tensor } from 'deepbox/ndarray';

// High cardinality categorical data
const userIds = tensor([["user_1"], ["user_2"], ["user_3"]]);

const encoder = new OneHotEncoder({ sparse: true });
const encoded = encoder.fitTransform(userIds);
// Returns CSRMatrix instead of dense tensor
// Much more memory efficient for high cardinality features

When to Use Each Encoder

LabelEncoder

Use for: Target labels in classification. Creates simple integer mapping [0, n_classes-1].

OneHotEncoder

Use for: Categorical features with no ordinal relationship. Creates binary columns (can return sparse matrices).

OrdinalEncoder

Use for: Categorical features with ordinal relationship (e.g., low/medium/high). Maintains single column per feature.

LabelBinarizer

Use for: Single-label classification targets. Creates binary matrix representation.

MultiLabelBinarizer

Use for: Multi-label classification (samples can have multiple labels). Each row can have multiple active columns.

Build docs developers (and LLMs) love