Skip to main content
The Datasets module provides built-in datasets for machine learning experimentation, along with data generation utilities and efficient data loading. It includes classic ML datasets and synthetic data generators.

Overview

The datasets module offers:
  • Classic Datasets: Iris, Boston Housing, MNIST-like digits, and more
  • Synthetic Generators: Create custom datasets for classification and regression
  • Data Loaders: Efficient batch loading for training neural networks
  • Preprocessing: Pre-split train/test sets ready for modeling

Key Features

Ready-to-Use

Pre-loaded classic datasets for quick experiments.

Data Generators

Create synthetic data with custom properties.

Efficient Loading

DataLoader for batching and shuffling.

Standardized Format

All datasets return consistent data structures.

Built-in Datasets

Dataset Structure

All dataset loaders return a Dataset object with this structure:
interface Dataset {
  data: Tensor;        // Features (X)
  target: Tensor;      // Labels (y)
  featureNames?: string[];  // Feature names
  targetNames?: string[];   // Class names
  description?: string;     // Dataset description
}

Classification Datasets

Iris Dataset

import { loadIris } from 'deepbox/datasets';

const { data, target, featureNames, targetNames } = loadIris();

console.log(data.shape);      // [150, 4] - 150 samples, 4 features
console.log(target.shape);    // [150] - 3 classes (0, 1, 2)
console.log(featureNames);    // ['sepal length', 'sepal width', ...]
console.log(targetNames);     // ['setosa', 'versicolor', 'virginica']

// Use with ML model
import { LogisticRegression } from 'deepbox/ml';

const model = new LogisticRegression();
model.fit(data, target);

Breast Cancer Dataset

import { loadBreastCancer } from 'deepbox/datasets';

const { data, target, featureNames, targetNames } = loadBreastCancer();

console.log(data.shape);   // [569, 30] - 30 features
console.log(targetNames);  // ['malignant', 'benign']

Digits Dataset

import { loadDigits } from 'deepbox/datasets';

// Handwritten digits (0-9)
const { data, target } = loadDigits();

console.log(data.shape);    // [1797, 64] - 8x8 images flattened
console.log(target.shape);  // [1797] - 10 classes (0-9)

// Reshape for visualization
const image = data.slice([0]).reshape([8, 8]);

Regression Datasets

Diabetes Dataset

import { loadDiabetes } from 'deepbox/datasets';

const { data, target } = loadDiabetes();

console.log(data.shape);    // [442, 10] - 10 features
console.log(target.shape);  // [442] - Disease progression

// Use with regression model
import { LinearRegression } from 'deepbox/ml';

const model = new LinearRegression();
model.fit(data, target);

Housing Dataset

import { loadHousingMini } from 'deepbox/datasets';

const { data, target, featureNames } = loadHousingMini();

console.log(data.shape);   // Housing features
console.log(featureNames); // Feature descriptions

More Datasets

import { 
  loadLinnerud,
  loadSensorStates,
  loadStudentPerformance,
  loadWeatherOutcomes,
  loadFruitQuality,
  loadFlowersExtended,
  loadLeafShapes,
  loadSeedMorphology
} from 'deepbox/datasets';

// Exercise dataset
const linnerud = loadLinnerud();

// Sensor readings
const sensors = loadSensorStates();

// Educational data
const students = loadStudentPerformance();

// Weather classification
const weather = loadWeatherOutcomes();

Synthetic Data Generators

Classification Data

import { makeClassification } from 'deepbox/datasets';

const { data, target } = makeClassification({
  nSamples: 1000,
  nFeatures: 20,
  nInformative: 15,
  nRedundant: 5,
  nClasses: 3,
  nClustersPerClass: 2,
  weights: [0.5, 0.3, 0.2],  // Class distribution
  flipY: 0.01,                // Label noise
  randomState: 42
});

console.log(data.shape);    // [1000, 20]
console.log(target.shape);  // [1000]

Regression Data

import { makeRegression } from 'deepbox/datasets';

const { data, target } = makeRegression({
  nSamples: 1000,
  nFeatures: 10,
  nInformative: 5,
  noise: 10.0,
  bias: 0.0,
  randomState: 42
});

console.log(data.shape);    // [1000, 10]
console.log(target.shape);  // [1000]

Clustering Data

Blobs

import { makeBlobs } from 'deepbox/datasets';

const { data, target, centers } = makeBlobs({
  nSamples: 1000,
  nFeatures: 2,
  centers: 3,
  clusterStd: 1.0,
  randomState: 42
});

// Use with K-Means
import { KMeans } from 'deepbox/ml';
const kmeans = new KMeans({ nClusters: 3 });
kmeans.fit(data);

Moons

import { makeMoons } from 'deepbox/datasets';

// Two interleaving half circles
const { data, target } = makeMoons({
  nSamples: 1000,
  noise: 0.1,
  randomState: 42
});

// Good for testing non-linear classifiers

Circles

import { makeCircles } from 'deepbox/datasets';

// Concentric circles
const { data, target } = makeCircles({
  nSamples: 1000,
  noise: 0.05,
  factor: 0.5,  // Scale factor between circles
  randomState: 42
});

Advanced Generators

import { 
  makeGaussianQuantiles,
  loadGaussianIslands,
  loadConcentricRings,
  loadSpiralArms,
  loadMoonsMulti,
  loadPerfectlySeparable
} from 'deepbox/datasets';

// Gaussian quantiles for multi-class
const gaussianData = makeGaussianQuantiles({
  nSamples: 1000,
  nFeatures: 2,
  nClasses: 3,
  randomState: 42
});

// Complex geometric patterns
const islands = loadGaussianIslands();
const rings = loadConcentricRings();
const spirals = loadSpiralArms();
const moons = loadMoonsMulti();

Data Loaders

Basic DataLoader

import { DataLoader } from 'deepbox/datasets';
import { tensor } from 'deepbox/ndarray';

const X = tensor([[...], [...], ...]);  // Features
const y = tensor([...]);                // Labels

// Create data loader
const loader = new DataLoader(X, y, {
  batchSize: 32,
  shuffle: true,
  dropLast: false
});

// Iterate over batches
for (const { input, target } of loader) {
  console.log(input.shape);   // [32, nFeatures]
  console.log(target.shape);  // [32]
  
  // Train model on batch
  const output = model.forward(input);
  const loss = criterion(output, target);
  // ...
}

Training with DataLoader

import { DataLoader } from 'deepbox/datasets';
import { trainTestSplit } from 'deepbox/preprocess';
import { Sequential, Linear, ReLU } from 'deepbox/nn';
import { Adam } from 'deepbox/optim';
import { crossEntropyLoss } from 'deepbox/nn';

// Load and split data
const { data, target } = loadIris();
const { XTrain, XTest, yTrain, yTest } = trainTestSplit(data, target, {
  testSize: 0.2
});

// Create data loaders
const trainLoader = new DataLoader(XTrain, yTrain, {
  batchSize: 16,
  shuffle: true
});

const testLoader = new DataLoader(XTest, yTest, {
  batchSize: 16,
  shuffle: false
});

// Define model
const model = new Sequential([
  new Linear(4, 10),
  new ReLU(),
  new Linear(10, 3)
]);

const optimizer = new Adam(model.parameters(), { lr: 0.01 });

// Training loop
for (let epoch = 0; epoch < 100; epoch++) {
  model.train();
  
  for (const { input, target } of trainLoader) {
    optimizer.zeroGrad();
    
    const output = model.forward(input);
    const loss = crossEntropyLoss(output, target);
    
    loss.backward();
    optimizer.step();
  }
  
  // Validation
  model.eval();
  let correct = 0;
  let total = 0;
  
  for (const { input, target } of testLoader) {
    const output = model.forward(input);
    const predicted = output.argmax(1);
    
    total += target.size;
    correct += predicted.equal(target).sum().item();
  }
  
  const accuracy = correct / total;
  console.log(`Epoch ${epoch}: Accuracy = ${(accuracy * 100).toFixed(2)}%`);
}

Use Cases

Test algorithms on standard datasets:
import { loadIris } from 'deepbox/datasets';
import { RandomForestClassifier } from 'deepbox/ml';
import { accuracy } from 'deepbox/metrics';
import { trainTestSplit } from 'deepbox/preprocess';

const { data, target } = loadIris();
const { XTrain, XTest, yTrain, yTest } = trainTestSplit(data, target);

const model = new RandomForestClassifier();
model.fit(XTrain, yTrain);

const yPred = model.predict(XTest);
const acc = accuracy(yTest, yPred);
Compare models on same dataset:
import { loadBreastCancer } from 'deepbox/datasets';
import { LogisticRegression, RandomForestClassifier } from 'deepbox/ml';

const { data, target } = loadBreastCancer();

const models = [
  new LogisticRegression(),
  new RandomForestClassifier({ nEstimators: 100 })
];

for (const model of models) {
  // Train and evaluate each model
}
Generate data with known properties:
import { makeClassification } from 'deepbox/datasets';

// Easy problem (high separability)
const easy = makeClassification({
  nSamples: 1000,
  nFeatures: 20,
  nInformative: 18,
  nRedundant: 2,
  nClasses: 2,
  randomState: 42
});

// Hard problem (low separability, noise)
const hard = makeClassification({
  nSamples: 1000,
  nFeatures: 20,
  nInformative: 5,
  nRedundant: 10,
  nClasses: 5,
  flipY: 0.1,  // 10% label noise
  randomState: 42
});

Dataset Catalog

Classification

  • loadIris() - 3 classes, 150 samples, 4 features
  • loadBreastCancer() - 2 classes, 569 samples, 30 features
  • loadDigits() - 10 classes, 1797 samples, 64 features (8x8 images)
  • loadWeatherOutcomes() - Weather classification
  • loadSensorStates() - Sensor reading classification

Regression

  • loadDiabetes() - 442 samples, 10 features
  • loadHousingMini() - Housing price prediction
  • loadLinnerud() - Exercise physiological data
  • loadStudentPerformance() - Educational data

Clustering & Synthetic

  • makeBlobs() - Isotropic Gaussian blobs
  • makeMoons() - Two interleaving half circles
  • makeCircles() - Concentric circles
  • makeClassification() - General classification
  • makeRegression() - General regression

Best Practices

Use randomState parameter in data generators for reproducible results.
Start with small, well-understood datasets like Iris or Digits before moving to larger datasets.
Use synthetic data generators to create datasets with specific properties for testing algorithms.
Always split data into train/test sets before evaluation. Never test on training data.

Preprocessing

Split and scale dataset features

Machine Learning

Train models on datasets

Plotting

Visualize dataset distributions

Learn More

API Reference

Complete API documentation

Examples

Dataset usage examples

Build docs developers (and LLMs) love