Skip to main content
Optimizers update model parameters based on computed gradients during training. Deepbox provides implementations of popular optimization algorithms including SGD, Adam, AdamW, and more.

Base Optimizer

All optimizers extend the abstract Optimizer base class, which provides common functionality for parameter management, gradient zeroing, and state persistence.

Common Methods

All optimizers implement these methods:
  • step(closure?) - Perform a single optimization step (parameter update)
  • zeroGrad() - Zero out all parameter gradients
  • addParamGroup(group) - Add a new parameter group with custom hyperparameters
  • stateDict() - Get optimizer state for checkpointing
  • loadStateDict(state) - Load optimizer state from checkpoint

SGD

Stochastic Gradient Descent optimizer with optional momentum, weight decay, and Nesterov acceleration.
import { SGD } from 'deepbox/optim';

const optimizer = new SGD(model.parameters(), {
  lr: 0.01,
  momentum: 0.9,
  weightDecay: 5e-4,
  nesterov: true
});

Constructor

params
Iterable<GradTensor> | ReadonlyArray<ParamGroup<SGDOptions>>
required
Model parameters to optimize, or array of parameter groups with per-group options
options
object
Optimization hyperparameters

Training Loop Example

const optimizer = new SGD(model.parameters(), {
  lr: 0.01,
  momentum: 0.9,
  weightDecay: 5e-4
});

for (let epoch = 0; epoch < numEpochs; epoch++) {
  for (const [inputs, targets] of dataLoader) {
    optimizer.zeroGrad();
    const outputs = model.forward(inputs);
    const loss = criterion(outputs, targets);
    loss.backward();
    optimizer.step();
  }
}

Methods

getLearningRate
(groupIdx?: number) => number
Get the current learning rate for a parameter group (default: group 0)
setLearningRate
(lr: number) => void
Set the learning rate for all parameter groups

Adam

Adaptive Moment Estimation optimizer that computes adaptive learning rates for each parameter using running averages of gradients and their squared values.
import { Adam } from 'deepbox/optim';

const optimizer = new Adam(model.parameters(), {
  lr: 0.001,
  beta1: 0.9,
  beta2: 0.999
});

Constructor

params
Iterable<GradTensor> | ReadonlyArray<ParamGroup<AdamOptions>>
required
Model parameters to optimize, or array of parameter groups
options
object
Optimization hyperparameters

Example with Parameter Groups

const optimizer = new Adam([
  { params: model.backbone.parameters(), lr: 0.001 },
  { params: model.classifier.parameters(), lr: 0.01 }
], { beta1: 0.9, beta2: 0.999 });

Methods

getLearningRate
(groupIdx?: number) => number
Get the current learning rate for a parameter group
setLearningRate
(lr: number) => void
Set the learning rate for all parameter groups

AdamW

Adam with decoupled Weight decay. Fixes the weight decay implementation in Adam by applying it directly to parameters rather than including it in the gradient-based update. This leads to better generalization and is the recommended variant for most applications.
import { AdamW } from 'deepbox/optim';

const optimizer = new AdamW(model.parameters(), {
  lr: 0.001,
  weightDecay: 0.01  // Typical value for AdamW
});

Constructor

params
Iterable<GradTensor> | ReadonlyArray<ParamGroup<AdamWOptions>>
required
Model parameters to optimize
options
object
Optimization hyperparameters

Training Example

const optimizer = new AdamW(model.parameters(), {
  lr: 0.001,
  weightDecay: 0.01,
  beta1: 0.9,
  beta2: 0.999
});

for (let epoch = 0; epoch < numEpochs; epoch++) {
  optimizer.zeroGrad();
  const loss = computeLoss();
  loss.backward();
  optimizer.step();
}

Methods

getLearningRate
(groupIdx?: number) => number
Get the current learning rate for a parameter group
setLearningRate
(lr: number) => void
Set the learning rate for all parameter groups

Nadam

Nesterov-accelerated Adam optimizer. Combines Adam’s adaptive learning rates with Nesterov momentum for potentially faster convergence by applying “look-ahead” gradients.
import { Nadam } from 'deepbox/optim';

const optimizer = new Nadam(model.parameters(), {
  lr: 0.002,
  beta1: 0.9,
  beta2: 0.999
});

Constructor

params
Iterable<GradTensor> | ReadonlyArray<ParamGroup<NadamOptions>>
required
Model parameters to optimize
options
object
Optimization hyperparameters

Methods

getLearningRate
(groupIdx?: number) => number
Get the current learning rate for a parameter group
setLearningRate
(lr: number) => void
Set the learning rate for all parameter groups

RMSprop

Root Mean Square Propagation optimizer. Adapts the learning rate for each parameter by dividing by a running average of recent gradient magnitudes. Particularly effective for RNNs and non-stationary objectives.
import { RMSprop } from 'deepbox/optim';

const optimizer = new RMSprop(model.parameters(), {
  lr: 0.01,
  alpha: 0.99,
  momentum: 0.9,
  centered: true
});

Constructor

params
Iterable<GradTensor> | ReadonlyArray<ParamGroup<RMSpropOptions>>
required
Model parameters to optimize
options
object
Optimization hyperparameters

Training Example

const optimizer = new RMSprop(model.parameters(), {
  lr: 0.01,
  alpha: 0.99,
  momentum: 0.9,
  centered: true
});

for (let epoch = 0; epoch < numEpochs; epoch++) {
  optimizer.zeroGrad();
  const loss = computeLoss();
  loss.backward();
  optimizer.step();
}

Methods

getLearningRate
(groupIdx?: number) => number
Get the current learning rate for a parameter group
setLearningRate
(lr: number) => void
Set the learning rate for all parameter groups

Adagrad

Adaptive Gradient Algorithm. Adapts the learning rate for each parameter based on the historical sum of squared gradients. Parameters with larger gradients receive smaller effective learning rates.
import { Adagrad } from 'deepbox/optim';

const optimizer = new Adagrad(model.parameters(), {
  lr: 0.01,
  eps: 1e-10
});

Constructor

params
Iterable<GradTensor> | ReadonlyArray<ParamGroup<AdagradOptions>>
required
Model parameters to optimize
options
object
Optimization hyperparameters

Training Example

const optimizer = new Adagrad(model.parameters(), {
  lr: 0.01,
  eps: 1e-10
});

for (let epoch = 0; epoch < numEpochs; epoch++) {
  optimizer.zeroGrad();
  const loss = computeLoss();
  loss.backward();
  optimizer.step();
}

Methods

getLearningRate
(groupIdx?: number) => number
Get the current learning rate for a parameter group
setLearningRate
(lr: number) => void
Set the learning rate for all parameter groups

AdaDelta

Adaptive learning rate method that seeks to reduce Adagrad’s aggressive, monotonically decreasing learning rate. Uses a moving window of gradient updates rather than accumulating all past gradients.
import { AdaDelta } from 'deepbox/optim';

const optimizer = new AdaDelta(model.parameters(), {
  lr: 1.0,
  rho: 0.9,
  eps: 1e-6
});

Constructor

params
Iterable<GradTensor> | ReadonlyArray<ParamGroup<AdaDeltaOptions>>
required
Model parameters to optimize
options
object
Optimization hyperparameters

Training Example

const optimizer = new AdaDelta(model.parameters(), {
  lr: 1.0,
  rho: 0.9,
  eps: 1e-6
});

for (let epoch = 0; epoch < numEpochs; epoch++) {
  optimizer.zeroGrad();
  const loss = computeLoss();
  loss.backward();
  optimizer.step();
}

Methods

getLearningRate
(groupIdx?: number) => number
Get the current learning rate for a parameter group
setLearningRate
(lr: number) => void
Set the learning rate for all parameter groups

Choosing an Optimizer

  • SGD with momentum - Classic choice, works well with proper tuning, good for fine-tuning
  • Adam - Good default choice, works well out-of-the-box for many problems
  • AdamW - Preferred over Adam for better generalization, especially for transformers
  • Nadam - Try when Adam plateaus, combines adaptive learning with Nesterov momentum
  • RMSprop - Good for RNNs and online learning scenarios
  • Adagrad - Useful for sparse data and features
  • AdaDelta - Parameter-free alternative to Adagrad (lr=1.0 typically works)

Build docs developers (and LLMs) love