Overview
Thetorch.optim package provides implementations of various optimization algorithms commonly used for training neural networks. All optimizers inherit from torch.optim.Optimizer and implement the step() method to update parameters.
Basic Usage
SGD
Stochastic Gradient Descent optimizer with optional momentum.Parameters
Iterable of parameters to optimize or dicts defining parameter groups
Learning rate
Momentum factor. Accelerates SGD in the relevant direction and dampens oscillations
Dampening for momentum. Used to reduce the momentum’s contribution to velocity
Weight decay (L2 penalty) coefficient
Whether to use Nesterov momentum. Only applicable when momentum is non-zero
Maximize the objective with respect to the params, instead of minimizing
Whether to use the foreach (multi-tensor) implementation. Faster for many small tensors
Whether autograd should occur through the optimizer step. Enables higher-order gradients
Whether to use a fused kernel implementation. Generally faster on CUDA
Example
Algorithm
The update rule with momentum:Adam
Adaptive Moment Estimation optimizer.Parameters
Iterable of parameters to optimize or dicts defining parameter groups
Learning rate. Tensor LR requires
fused=True or capturable=TrueCoefficients (beta1, beta2) for computing running averages of gradient and its square
Term added to denominator to improve numerical stability
Weight decay (L2 penalty) coefficient
Whether to use the AMSGrad variant from “On the Convergence of Adam and Beyond”
Whether to use the foreach (multi-tensor) implementation
Maximize the objective instead of minimizing
Whether this optimizer can be safely captured in a CUDA graph
Whether autograd should occur through the optimizer step
Whether to use a fused kernel implementation
If True, this optimizer is equivalent to AdamW
Example
Algorithm
AdamW
Adam with decoupled weight decay regularization.Parameters
Iterable of parameters to optimize or dicts defining parameter groups
Learning rate
Coefficients for computing running averages of gradient and its square
Term added to denominator for numerical stability
Weight decay coefficient (decoupled from gradient-based update)
Whether to use the AMSGrad variant
Example
Key Difference from Adam
AdamW decouples weight decay from the gradient-based update:RMSprop
Root Mean Square Propagation optimizer.Parameters
Iterable of parameters to optimize or dicts defining parameter groups
Learning rate
Smoothing constant for squared gradient moving average
Term added to denominator for numerical stability
Weight decay (L2 penalty) coefficient
Momentum factor
If True, compute centered RMSProp (gradient normalized by variance estimate)
Example
Adagrad
Adaptive Gradient Algorithm optimizer.Parameters
Iterable of parameters to optimize or dicts defining parameter groups
Learning rate
Learning rate decay
Weight decay (L2 penalty) coefficient
Initial value for the gradient accumulator
Term added to denominator for numerical stability
Example
Other Optimizers
Adadelta
Adadelta
An adaptive learning rate method that uses running averages of squared gradients.
Adamax
Adamax
Variant of Adam based on infinity norm.
ASGD
ASGD
Averaged Stochastic Gradient Descent.
LBFGS
LBFGS
Limited-memory BFGS algorithm (quasi-Newton method).
NAdam
NAdam
Nesterov-accelerated Adam.
RAdam
RAdam
Rectified Adam.
Rprop
Rprop
Resilient Backpropagation.
SparseAdam
SparseAdam
Lazy version of Adam for sparse tensors.
Optimizer Methods
All optimizers implement the following methods:step()
A closure that reevaluates the model and returns the loss. Required for some optimizers like LBFGS
zero_grad()
If True, sets gradients to None instead of zero. More memory efficient
state_dict()
load_state_dict()
Parameter Groups
Optimizers can handle different learning rates and options for different parameter groups:model.baseparameters uselr=1e-2model.classifierparameters uselr=1e-3- Both use
momentum=0.9
See Also
- Learning Rate Schedulers - Adjust learning rates during training
- Optimizer Tutorial