Data Splitting - Deepbox

Data splitting utilities for creating training and test sets, with support for stratification and cross-validation.

trainTestSplit

Split arrays into random train and test subsets.

import { trainTestSplit } from 'deepbox/preprocess';
import { tensor } from 'deepbox/ndarray';

const X = tensor([[1, 2], [3, 4], [5, 6], [7, 8]]);
const y = tensor([0, 1, 0, 1]);
const [XTrain, XTest, yTrain, yTest] = trainTestSplit(X, y, { testSize: 0.25 });

Signature

function trainTestSplit(
  X: Tensor,
  y?: Tensor,
  options?: {
    testSize?: number;      // Proportion (0-1) or absolute count
    trainSize?: number;     // Proportion (0-1) or absolute count
    randomState?: number;   // Random seed for reproducibility
    shuffle?: boolean;      // Shuffle before splitting (default: true)
    stratify?: Tensor;      // Stratify split using these labels
  }
): Tensor[]

Parameters

Tensor

required

Feature matrix (2D tensor)

Tensor

Optional target labels (1D tensor). If provided, returns 4 tensors [XTrain, XTest, yTrain, yTest]. If not provided, returns 2 tensors [XTrain, XTest].

options.testSize

number

Size of test set:

Float (0-1): Proportion of dataset
Integer ≥1: Absolute number of samples
Default: 0.25 (25% of data)

options.trainSize

number

Size of training set:

Float (0-1): Proportion of dataset
Integer ≥1: Absolute number of samples
If not specified, complement of testSize

options.randomState

number

Random seed for reproducible splits

options.shuffle

boolean

default:"true"

Whether to shuffle data before splitting

options.stratify

Tensor

If provided, data is split in a stratified fashion, preserving the percentage of samples for each class. Must be a 1D tensor with same length as X.

Returns

Without y: [XTrain, XTest]
With y: [XTrain, XTest, yTrain, yTest]

Examples

Basic Split

import { trainTestSplit } from 'deepbox/preprocess';
import { tensor } from 'deepbox/ndarray';

const X = tensor([[1, 2], [3, 4], [5, 6], [7, 8]]);
const [XTrain, XTest] = trainTestSplit(X, undefined, { testSize: 0.5 });
// XTrain: 2 samples, XTest: 2 samples

Stratified Split

const X = tensor([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]]);
const y = tensor([0, 0, 1, 1, 1, 0]);

const [XTrain, XTest, yTrain, yTest] = trainTestSplit(X, y, {
  testSize: 0.5,
  stratify: y  // Preserve class distribution
});
// yTrain and yTest will have same proportion of 0s and 1s as y

Reproducible Split

const [XTrain, XTest, yTrain, yTest] = trainTestSplit(X, y, {
  testSize: 0.3,
  randomState: 42  // Same seed = same split
});

KFold

K-Folds cross-validator. Provides train/test indices to split data in train/test sets. Splits dataset into k consecutive folds.

import { KFold } from 'deepbox/preprocess';
import { tensor } from 'deepbox/ndarray';

const X = tensor([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]]);
const kfold = new KFold({ nSplits: 5 });
const splits = kfold.split(X);

for (const { trainIndex, testIndex } of splits) {
  // Each fold: trainIndex and testIndex are arrays of sample indices
}

Constructor

new KFold(options?: {
  nSplits?: number;       // Number of folds (default: 5)
  shuffle?: boolean;      // Shuffle before splitting (default: false)
  randomState?: number;   // Random seed if shuffle=true
})

Methods

split

(X: Tensor) => SplitResult[]

Generate train/test indices for k-fold cross-validation.Parameters:

X - Data tensor (only uses X.shape[0] for sample count)

Returns: Array of split objects with { trainIndex, testIndex }

getNSplits

() => number

Returns the number of splits/folds.

Example

import { KFold } from 'deepbox/preprocess';
import { tensor } from 'deepbox/ndarray';

const X = tensor([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]]);
const y = tensor([0, 1, 0, 1, 0, 1, 0, 1, 0, 1]);

const kfold = new KFold({ nSplits: 5, shuffle: true, randomState: 42 });
const splits = kfold.split(X);

for (let i = 0; i < splits.length; i++) {
  const { trainIndex, testIndex } = splits[i];
  console.log(`Fold ${i + 1}:`);
  console.log('Train indices:', trainIndex);
  console.log('Test indices:', testIndex);
}

StratifiedKFold

Stratified K-Folds cross-validator. Provides train/test indices while preserving class distribution in each fold.

import { StratifiedKFold } from 'deepbox/preprocess';
import { tensor } from 'deepbox/ndarray';

const X = tensor([[1, 2], [3, 4], [5, 6], [7, 8]]);
const y = tensor([0, 0, 1, 1]);
const skfold = new StratifiedKFold({ nSplits: 2 });
const splits = skfold.split(X, y);

Constructor

new StratifiedKFold(options?: {
  nSplits?: number;       // Number of folds (default: 5)
  shuffle?: boolean;      // Shuffle before splitting (default: false)
  randomState?: number;   // Random seed if shuffle=true
})

Methods

split

(X: Tensor, y: Tensor) => SplitResult[]

Generate stratified train/test indices.Parameters:

X - Data tensor
y - Target labels (1D tensor)

Returns: Array of split objects with { trainIndex, testIndex }Note: Each class must have at least nSplits samples

getNSplits

() => number

Returns the number of splits/folds.

Example

import { StratifiedKFold } from 'deepbox/preprocess';
import { tensor } from 'deepbox/ndarray';

const X = tensor([[1], [2], [3], [4], [5], [6], [7], [8]]);
const y = tensor([0, 0, 0, 0, 1, 1, 1, 1]);

const skfold = new StratifiedKFold({ nSplits: 4 });
const splits = skfold.split(X, y);

// Each fold maintains 50/50 class distribution
for (const { trainIndex, testIndex } of splits) {
  // Use indices to create train/test sets
}

GroupKFold

Group K-Fold cross-validator. Ensures that the same group is not in both training and test sets.

import { GroupKFold } from 'deepbox/preprocess';
import { tensor } from 'deepbox/ndarray';

const X = tensor([[1], [2], [3], [4], [5], [6]]);
const y = tensor([0, 0, 1, 1, 1, 0]);
const groups = tensor([1, 1, 2, 2, 3, 3]);

const gkfold = new GroupKFold({ nSplits: 3 });
const splits = gkfold.split(X, y, groups);

Constructor

new GroupKFold(options?: {
  nSplits?: number;  // Number of folds (default: 5)
})

Methods

split

(X: Tensor, y: Tensor | undefined, groups: Tensor) => SplitResult[]

Generate group-aware train/test indices.Parameters:

X - Data tensor
y - Target labels (can be undefined)
groups - Group labels (1D tensor, same length as X)

Returns: Array of split objects with { trainIndex, testIndex }Note: Number of unique groups must be ≥ nSplits

getNSplits

() => number

Returns the number of splits/folds.

Example

import { GroupKFold } from 'deepbox/preprocess';
import { tensor } from 'deepbox/ndarray';

// Patient data - same patient should not be in both train and test
const X = tensor([[1, 2], [3, 4], [5, 6], [7, 8]]);
const y = tensor([0, 1, 0, 1]);
const patientIds = tensor([1, 1, 2, 2]);  // Patient 1 has 2 samples, Patient 2 has 2 samples

const gkfold = new GroupKFold({ nSplits: 2 });
const splits = gkfold.split(X, y, patientIds);

// Each fold will have different patients in train vs test

LeaveOneOut

Leave-One-Out cross-validator. Each sample is used once as test set (singleton) while remaining samples form training set.

import { LeaveOneOut } from 'deepbox/preprocess';
import { tensor } from 'deepbox/ndarray';

const X = tensor([[1], [2], [3], [4]]);
const loo = new LeaveOneOut();
const splits = loo.split(X);
// Generates 4 splits (one for each sample)

Methods

split

(X: Tensor) => SplitResult[]

Generate leave-one-out train/test indices.Parameters:

X - Data tensor

Returns: Array of n splits where n = number of samples

getNSplits

(X: Tensor) => number

Returns the number of splits (equal to number of samples).

Example

import { LeaveOneOut } from 'deepbox/preprocess';
import { tensor } from 'deepbox/ndarray';

const X = tensor([[1, 2], [3, 4], [5, 6]]);
const y = tensor([0, 1, 0]);

const loo = new LeaveOneOut();
const splits = loo.split(X);

for (let i = 0; i < splits.length; i++) {
  const { trainIndex, testIndex } = splits[i];
  console.log(`Iteration ${i + 1}:`);
  console.log('Train:', trainIndex);  // All indices except i
  console.log('Test:', testIndex);    // [i]
}

LeavePOut

Leave-P-Out cross-validator. Generates all possible train/test splits by leaving out p samples.

import { LeavePOut } from 'deepbox/preprocess';
import { tensor } from 'deepbox/ndarray';

const X = tensor([[1], [2], [3], [4]]);
const lpo = new LeavePOut(2);  // Leave 2 samples out
const splits = lpo.split(X);
// Generates C(4,2) = 6 splits

Constructor

new LeavePOut(p: number)

Parameters:

p - Number of samples to leave out (must be positive integer)

Methods

split

(X: Tensor) => SplitResult[]

Generate all leave-p-out train/test combinations.Parameters:

X - Data tensor

Returns: Array of C(n, p) splits where n = number of samplesWarning: Number of splits grows combinatorially. Limited to 100,000 splits for memory safety.

getNSplits

(X: Tensor) => number

Returns the number of splits: C(n, p) = n! / (p! * (n-p)!)

Example

import { LeavePOut } from 'deepbox/preprocess';
import { tensor } from 'deepbox/ndarray';

const X = tensor([[1], [2], [3], [4], [5]]);
const lpo = new LeavePOut(2);
const splits = lpo.split(X);

console.log(`Number of splits: ${splits.length}`);  // C(5,2) = 10

for (const { trainIndex, testIndex } of splits) {
  console.log('Train:', trainIndex, 'Test:', testIndex);
}

Warning: LeavePOut can generate a very large number of splits:

C(10, 2) = 45
C(20, 2) = 190
C(50, 5) = 2,118,760 ❌ (exceeds limit)

Type Definitions

// Result of a single train/test split
type SplitResult = {
  readonly trainIndex: number[];  // Indices for training set
  readonly testIndex: number[];   // Indices for test set
};

Cross-Validation Examples

Model Evaluation with K-Fold

import { KFold } from 'deepbox/preprocess';
import { tensor } from 'deepbox/ndarray';

const X = tensor([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]]);
const y = tensor([0, 1, 0, 1, 0]);

const kfold = new KFold({ nSplits: 5, shuffle: true, randomState: 42 });
const splits = kfold.split(X);

const scores: number[] = [];

for (const { trainIndex, testIndex } of splits) {
  // Extract train/test data using indices
  const XTrain = extractRows(X, trainIndex);
  const yTrain = extractRows(y, trainIndex);
  const XTest = extractRows(X, testIndex);
  const yTest = extractRows(y, testIndex);
  
  // Train model and evaluate
  // const model = trainModel(XTrain, yTrain);
  // const score = evaluateModel(model, XTest, yTest);
  // scores.push(score);
}

const avgScore = scores.reduce((a, b) => a + b, 0) / scores.length;
console.log('Average cross-validation score:', avgScore);

Stratified Cross-Validation

import { StratifiedKFold } from 'deepbox/preprocess';
import { tensor } from 'deepbox/ndarray';

// Imbalanced dataset
const X = tensor([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]]);
const y = tensor([0, 0, 0, 0, 0, 0, 0, 1, 1, 1]);  // 70% class 0, 30% class 1

const skfold = new StratifiedKFold({ nSplits: 5 });
const splits = skfold.split(X, y);

// Each fold maintains 70/30 class distribution
for (let i = 0; i < splits.length; i++) {
  const { trainIndex, testIndex } = splits[i];
  console.log(`Fold ${i + 1}: ${trainIndex.length} train, ${testIndex.length} test`);
}

Time Series Cross-Validation

import { KFold } from 'deepbox/preprocess';
import { tensor } from 'deepbox/ndarray';

// For time series, use KFold with shuffle=false to maintain order
const timeSeries = tensor([[1], [2], [3], [4], [5], [6], [7], [8]]);

const kfold = new KFold({ nSplits: 4, shuffle: false });
const splits = kfold.split(timeSeries);

// Each fold uses consecutive time periods
for (const { trainIndex, testIndex } of splits) {
  console.log('Train periods:', trainIndex);
  console.log('Test periods:', testIndex);
}

When to Use Each Splitter

trainTestSplit

Use for: Simple train/test splits. Good for large datasets. Supports stratification and reproducible splits.

KFold

Use for: Standard cross-validation. Works well with balanced data. Each sample appears in test set exactly once.

StratifiedKFold

Use for: Imbalanced classification. Preserves class distribution. Ensures each fold has representative class proportions.

GroupKFold

Use for: Data with groups (e.g., patients, users). Prevents data leakage by keeping groups separate.

LeaveOneOut

Use for: Very small datasets. Maximum training data per fold. Computationally expensive (n splits for n samples).

LeavePOut

Use for: Small datasets needing exhaustive testing. Warning: Combinatorial explosion for large n or p.

NDArray

DataFrame

Linear Algebra

Statistics

Machine Learning

Neural Networks

Optimization

Preprocessing

Metrics

Random

Plotting

Datasets

​trainTestSplit

​Signature

​Parameters

​Returns

​Examples

​Basic Split

​Stratified Split

​Reproducible Split

​KFold

​Constructor

​Methods

​Example

​StratifiedKFold

​Constructor

​Methods

​Example

​GroupKFold

​Constructor

​Methods

​Example

​LeaveOneOut

​Methods

​Example

​LeavePOut

​Constructor

​Methods

​Example

​Type Definitions

​Cross-Validation Examples

​Model Evaluation with K-Fold

​Stratified Cross-Validation

​Time Series Cross-Validation

​When to Use Each Splitter

trainTestSplit

KFold

StratifiedKFold

GroupKFold

LeaveOneOut

LeavePOut

Build docs developers (and LLMs) love

trainTestSplit

Signature

Parameters

Returns

Examples

Basic Split

Stratified Split

Reproducible Split

KFold

Constructor

Methods

Example

StratifiedKFold

Constructor

Methods

Example

GroupKFold

Constructor

Methods

Example

LeaveOneOut

Methods

Example

LeavePOut

Constructor

Methods

Example

Type Definitions

Cross-Validation Examples

Model Evaluation with K-Fold

Stratified Cross-Validation

Time Series Cross-Validation

When to Use Each Splitter