The Optimization (optim) module provides gradient-based optimization algorithms and learning rate scheduling for training neural networks. It includes popular optimizers like SGD, Adam, RMSprop, and various learning rate schedulers.
Overview
The optim module offers:
Optimizers : SGD, Adam, AdamW, RMSprop, Adagrad, Adadelta, Nadam
Learning Rate Schedulers : Step, exponential, cosine annealing, plateau-based
Parameter Groups : Different learning rates for different layers
State Management : Save and restore optimizer state
Key Features
PyTorch-Compatible API Familiar optimizer interface with parameter groups.
Adaptive Methods Adam, AdamW, and RMSprop for faster convergence.
LR Scheduling Dynamic learning rate adjustment during training.
State Persistence Save and restore optimizer state for checkpointing.
Optimizers
Stochastic Gradient Descent (SGD)
import { SGD } from 'deepbox/optim' ;
import { Sequential , Linear , ReLU } from 'deepbox/nn' ;
const model = new Sequential ([
new Linear ( 10 , 20 ),
new ReLU (),
new Linear ( 20 , 1 )
]);
// Basic SGD
const optimizer = new SGD ( model . parameters (), {
lr: 0.01 ,
momentum: 0.9 ,
dampening: 0 ,
weightDecay: 0 ,
nesterov: false
});
// Training step
optimizer . zeroGrad ();
const loss = computeLoss ();
loss . backward ();
optimizer . step ();
Adam Optimizer
import { Adam } from 'deepbox/optim' ;
// Adam with default parameters
const optimizer = new Adam ( model . parameters (), {
lr: 0.001 ,
betas: [ 0.9 , 0.999 ],
eps: 1e-8 ,
weightDecay: 0
});
// Training loop
for ( let epoch = 0 ; epoch < 100 ; epoch ++ ) {
for ( const batch of dataLoader ) {
optimizer . zeroGrad ();
const output = model . forward ( batch . input );
const loss = criterion ( output , batch . target );
loss . backward ();
optimizer . step ();
}
}
AdamW (Adam with Decoupled Weight Decay)
import { AdamW } from 'deepbox/optim' ;
// AdamW for better weight decay
const optimizer = new AdamW ( model . parameters (), {
lr: 0.001 ,
betas: [ 0.9 , 0.999 ],
eps: 1e-8 ,
weightDecay: 0.01 // Decoupled weight decay
});
RMSprop
import { RMSprop } from 'deepbox/optim' ;
const optimizer = new RMSprop ( model . parameters (), {
lr: 0.01 ,
alpha: 0.99 ,
eps: 1e-8 ,
weightDecay: 0 ,
momentum: 0
});
Adagrad
import { Adagrad } from 'deepbox/optim' ;
const optimizer = new Adagrad ( model . parameters (), {
lr: 0.01 ,
lrDecay: 0 ,
weightDecay: 0 ,
eps: 1e-10
});
Adadelta
import { AdaDelta } from 'deepbox/optim' ;
const optimizer = new AdaDelta ( model . parameters (), {
lr: 1.0 ,
rho: 0.9 ,
eps: 1e-6 ,
weightDecay: 0
});
Nadam
import { Nadam } from 'deepbox/optim' ;
// Nesterov-accelerated Adam
const optimizer = new Nadam ( model . parameters (), {
lr: 0.002 ,
betas: [ 0.9 , 0.999 ],
eps: 1e-8 ,
weightDecay: 0
});
Parameter Groups
Train different parts of the model with different hyperparameters:
import { Adam } from 'deepbox/optim' ;
const model = new MyModel ();
// Different learning rates for different layers
const optimizer = new Adam ([
{
params: model . backbone . parameters (),
lr: 0.0001 // Lower LR for pretrained backbone
},
{
params: model . head . parameters (),
lr: 0.001 // Higher LR for new head
}
], {
lr: 0.001 // Default LR
});
Learning Rate Schedulers
Step LR
import { SGD } from 'deepbox/optim' ;
import { StepLR } from 'deepbox/optim' ;
const optimizer = new SGD ( model . parameters (), { lr: 0.1 });
// Decay LR by gamma every stepSize epochs
const scheduler = new StepLR ( optimizer , {
stepSize: 30 ,
gamma: 0.1
});
// Training loop
for ( let epoch = 0 ; epoch < 100 ; epoch ++ ) {
train ( ... );
scheduler . step (); // Update learning rate
}
Multi-Step LR
import { MultiStepLR } from 'deepbox/optim' ;
// Decay at specific epochs
const scheduler = new MultiStepLR ( optimizer , {
milestones: [ 30 , 60 , 90 ],
gamma: 0.1
});
Exponential LR
import { ExponentialLR } from 'deepbox/optim' ;
// Exponential decay
const scheduler = new ExponentialLR ( optimizer , {
gamma: 0.95 // LR *= 0.95 each epoch
});
Cosine Annealing
import { CosineAnnealingLR } from 'deepbox/optim' ;
// Cosine annealing schedule
const scheduler = new CosineAnnealingLR ( optimizer , {
tMax: 100 , // Period of cosine cycle
etaMin: 0.0001 // Minimum learning rate
});
Reduce LR on Plateau
import { ReduceLROnPlateau } from 'deepbox/optim' ;
// Reduce LR when metric plateaus
const scheduler = new ReduceLROnPlateau ( optimizer , {
mode: 'min' , // Minimize metric
factor: 0.1 , // Reduce by 10x
patience: 10 , // Wait 10 epochs
threshold: 0.0001 , // Minimum improvement
cooldown: 0 ,
minLr: 0.00001
});
// After each validation
for ( let epoch = 0 ; epoch < 100 ; epoch ++ ) {
train ( ... );
const valLoss = validate ( ... );
scheduler . step ( valLoss ); // Pass metric
}
One Cycle LR
import { OneCycleLR } from 'deepbox/optim' ;
// One cycle policy (for super-convergence)
const scheduler = new OneCycleLR ( optimizer , {
maxLr: 0.1 ,
totalSteps: 1000 ,
pctStart: 0.3 , // Warmup for 30% of steps
annealStrategy: 'cos' ,
divFactor: 25.0 , // Initial LR = maxLr / divFactor
finalDivFactor: 10000.0
});
// Call after each batch
for ( let step = 0 ; step < 1000 ; step ++ ) {
trainBatch ( ... );
scheduler . step ();
}
Warmup LR
import { WarmupLR } from 'deepbox/optim' ;
// Linear warmup
const scheduler = new WarmupLR ( optimizer , {
warmupSteps: 1000 ,
warmupFactor: 0.1 // Start at 10% of base LR
});
Linear LR
import { LinearLR } from 'deepbox/optim' ;
// Linear interpolation
const scheduler = new LinearLR ( optimizer , {
startFactor: 0.1 ,
endFactor: 1.0 ,
totalIters: 100
});
Complete Training Example
import { Sequential , Linear , ReLU , Dropout , crossEntropyLoss } from 'deepbox/nn' ;
import { Adam , CosineAnnealingLR } from 'deepbox/optim' ;
import { GradTensor } from 'deepbox/ndarray' ;
// Define model
const model = new Sequential ([
new Linear ( 784 , 256 ),
new ReLU (),
new Dropout ( 0.2 ),
new Linear ( 256 , 128 ),
new ReLU (),
new Dropout ( 0.2 ),
new Linear ( 128 , 10 )
]);
// Create optimizer
const optimizer = new Adam ( model . parameters (), {
lr: 0.001 ,
weightDecay: 1e-4
});
// Create scheduler
const scheduler = new CosineAnnealingLR ( optimizer , {
tMax: 100 ,
etaMin: 1e-6
});
// Training loop
const numEpochs = 100 ;
for ( let epoch = 0 ; epoch < numEpochs ; epoch ++ ) {
let totalLoss = 0 ;
let numBatches = 0 ;
// Training phase
model . train ();
for ( const { images , labels } of trainLoader ) {
// Zero gradients
optimizer . zeroGrad ();
// Forward pass
const output = model . forward ( images );
const loss = crossEntropyLoss ( output , labels );
// Backward pass
loss . backward ();
// Update weights
optimizer . step ();
totalLoss += loss . item ();
numBatches ++ ;
}
// Update learning rate
scheduler . step ();
const avgLoss = totalLoss / numBatches ;
console . log ( `Epoch ${ epoch } : Loss = ${ avgLoss . toFixed ( 4 ) } , LR = ${ scheduler . getLr () } ` );
// Validation phase
model . eval ();
let correct = 0 ;
let total = 0 ;
for ( const { images , labels } of valLoader ) {
const output = model . forward ( images );
const predicted = output . argmax ( 1 );
total += labels . size ;
correct += predicted . equal ( labels ). sum (). item ();
}
const accuracy = correct / total ;
console . log ( `Validation Accuracy: ${ ( accuracy * 100 ). toFixed ( 2 ) } %` );
}
State Management
Save Optimizer State
import { Adam } from 'deepbox/optim' ;
const optimizer = new Adam ( model . parameters (), { lr: 0.001 });
// Train for a while...
// Save state
const state = optimizer . stateDict ();
const checkpoint = {
epoch: 50 ,
modelState: model . stateDict (),
optimizerState: state
};
// Save to disk
await saveCheckpoint ( checkpoint , 'checkpoint.json' );
Load Optimizer State
// Load checkpoint
const checkpoint = await loadCheckpoint ( 'checkpoint.json' );
// Restore model
model . loadStateDict ( checkpoint . modelState );
// Restore optimizer
const optimizer = new Adam ( model . parameters (), { lr: 0.001 });
optimizer . loadStateDict ( checkpoint . optimizerState );
// Continue training from epoch 50
for ( let epoch = checkpoint . epoch ; epoch < 100 ; epoch ++ ) {
// ...
}
Optimizer Selection Guide
Use SGD with momentum when:
You have a large dataset
You want better generalization
You’re training on a proven architecture
You can afford careful LR tuning
const optimizer = new SGD ( model . parameters (), {
lr: 0.1 ,
momentum: 0.9 ,
nesterov: true
});
Use Adam when:
You want fast convergence
You’re prototyping or experimenting
You have sparse gradients
Default hyperparameters work well
const optimizer = new Adam ( model . parameters (), {
lr: 0.001 // Usually works well
});
Use AdamW when:
You need weight decay (L2 regularization)
Training transformers or large models
Adam is overfitting
const optimizer = new AdamW ( model . parameters (), {
lr: 0.001 ,
weightDecay: 0.01 // Decoupled weight decay
});
Use RMSprop when:
Training recurrent neural networks
You have non-stationary objectives
Adam is unstable
const optimizer = new RMSprop ( model . parameters (), {
lr: 0.001 ,
alpha: 0.99
});
Best Practices
Start with Adam for initial experiments. It’s robust and requires minimal tuning.
Use learning rate schedulers to improve convergence. Start with ReduceLROnPlateau or CosineAnnealingLR.
For fine-tuning, use smaller learning rates (1e-5 to 1e-4) and AdamW with weight decay.
Always call optimizer.zeroGrad() before each backward pass to clear previous gradients.
When using parameter groups, ensure all parameters are included in exactly one group.
Common Patterns
Gradient Clipping
import { Adam } from 'deepbox/optim' ;
const optimizer = new Adam ( model . parameters (), { lr: 0.001 });
for ( const batch of dataLoader ) {
optimizer . zeroGrad ();
const loss = computeLoss ( batch );
loss . backward ();
// Clip gradients to prevent exploding gradients
for ( const param of model . parameters ()) {
if ( param . grad ) {
param . grad . clipByNorm ( 1.0 );
}
}
optimizer . step ();
}
Learning Rate Warmup + Decay
import { Adam , WarmupLR , CosineAnnealingLR } from 'deepbox/optim' ;
const optimizer = new Adam ( model . parameters (), { lr: 0.001 });
// Warmup for first 1000 steps
const warmup = new WarmupLR ( optimizer , { warmupSteps: 1000 });
// Then cosine decay
const scheduler = new CosineAnnealingLR ( optimizer , { tMax: 10000 });
for ( let step = 0 ; step < 11000 ; step ++ ) {
trainStep ( ... );
if ( step < 1000 ) {
warmup . step ();
} else {
scheduler . step ();
}
}
Neural Networks Layers and models to optimize
NDArray GradTensor and automatic differentiation
Metrics Track training progress
Learn More
API Reference Complete API documentation
Training Guide Learn optimization techniques