Skip to main content

Overview

The torch.optim package provides implementations of various optimization algorithms commonly used for training neural networks. All optimizers inherit from torch.optim.Optimizer and implement the step() method to update parameters.

Basic Usage

import torch
import torch.nn as nn
import torch.optim as optim

model = nn.Linear(10, 1)
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# Training loop
for input, target in dataset:
    optimizer.zero_grad()
    output = model(input)
    loss = loss_fn(output, target)
    loss.backward()
    optimizer.step()

SGD

Stochastic Gradient Descent optimizer with optional momentum.
torch.optim.SGD(params, lr=1e-3, momentum=0, dampening=0, 
                weight_decay=0, nesterov=False)

Parameters

params
iterable
required
Iterable of parameters to optimize or dicts defining parameter groups
lr
float | Tensor
default:"1e-3"
Learning rate
momentum
float
default:"0"
Momentum factor. Accelerates SGD in the relevant direction and dampens oscillations
dampening
float
default:"0"
Dampening for momentum. Used to reduce the momentum’s contribution to velocity
weight_decay
float
default:"0"
Weight decay (L2 penalty) coefficient
nesterov
bool
default:"False"
Whether to use Nesterov momentum. Only applicable when momentum is non-zero
maximize
bool
default:"False"
Maximize the objective with respect to the params, instead of minimizing
foreach
bool | None
default:"None"
Whether to use the foreach (multi-tensor) implementation. Faster for many small tensors
differentiable
bool
default:"False"
Whether autograd should occur through the optimizer step. Enables higher-order gradients
fused
bool | None
default:"None"
Whether to use a fused kernel implementation. Generally faster on CUDA

Example

optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
optimizer.zero_grad()
loss_fn(model(input), target).backward()
optimizer.step()

Algorithm

The update rule with momentum:
v_t+1 = momentum * v_t + (1 - dampening) * g_t+1
p_t+1 = p_t - lr * v_t+1
With Nesterov momentum:
v_t+1 = momentum * v_t + (1 - dampening) * g_t+1  
p_t+1 = p_t - lr * (g_t+1 + momentum * v_t+1)

Adam

Adaptive Moment Estimation optimizer.
torch.optim.Adam(params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8,
                 weight_decay=0, amsgrad=False)

Parameters

params
iterable
required
Iterable of parameters to optimize or dicts defining parameter groups
lr
float | Tensor
default:"1e-3"
Learning rate. Tensor LR requires fused=True or capturable=True
betas
tuple[float, float]
default:"(0.9, 0.999)"
Coefficients (beta1, beta2) for computing running averages of gradient and its square
eps
float
default:"1e-8"
Term added to denominator to improve numerical stability
weight_decay
float
default:"0"
Weight decay (L2 penalty) coefficient
amsgrad
bool
default:"False"
Whether to use the AMSGrad variant from “On the Convergence of Adam and Beyond”
foreach
bool | None
default:"None"
Whether to use the foreach (multi-tensor) implementation
maximize
bool
default:"False"
Maximize the objective instead of minimizing
capturable
bool
default:"False"
Whether this optimizer can be safely captured in a CUDA graph
differentiable
bool
default:"False"
Whether autograd should occur through the optimizer step
fused
bool | None
default:"None"
Whether to use a fused kernel implementation
decoupled_weight_decay
bool
default:"False"
If True, this optimizer is equivalent to AdamW

Example

optimizer = optim.Adam(model.parameters(), lr=1e-3, betas=(0.9, 0.999))
optimizer.zero_grad()
loss_fn(model(input), target).backward()
optimizer.step()

Algorithm

m_t = beta1 * m_t-1 + (1 - beta1) * g_t
v_t = beta2 * v_t-1 + (1 - beta2) * g_t^2
m_hat_t = m_t / (1 - beta1^t)
v_hat_t = v_t / (1 - beta2^t)
p_t = p_t-1 - lr * m_hat_t / (sqrt(v_hat_t) + eps)

AdamW

Adam with decoupled weight decay regularization.
torch.optim.AdamW(params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8,
                  weight_decay=1e-2, amsgrad=False)

Parameters

params
iterable
required
Iterable of parameters to optimize or dicts defining parameter groups
lr
float | Tensor
default:"1e-3"
Learning rate
betas
tuple[float, float]
default:"(0.9, 0.999)"
Coefficients for computing running averages of gradient and its square
eps
float
default:"1e-8"
Term added to denominator for numerical stability
weight_decay
float
default:"1e-2"
Weight decay coefficient (decoupled from gradient-based update)
amsgrad
bool
default:"False"
Whether to use the AMSGrad variant

Example

optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)

Key Difference from Adam

AdamW decouples weight decay from the gradient-based update:
# AdamW
p_t = p_t-1 - lr * weight_decay * p_t-1  # Weight decay
p_t = p_t - lr * m_hat_t / (sqrt(v_hat_t) + eps)  # Adam update

# Adam (coupled)
g_t = g_t + weight_decay * p_t-1  # Add to gradient

RMSprop

Root Mean Square Propagation optimizer.
torch.optim.RMSprop(params, lr=1e-2, alpha=0.99, eps=1e-8,
                    weight_decay=0, momentum=0, centered=False)

Parameters

params
iterable
required
Iterable of parameters to optimize or dicts defining parameter groups
lr
float | Tensor
default:"1e-2"
Learning rate
alpha
float
default:"0.99"
Smoothing constant for squared gradient moving average
eps
float
default:"1e-8"
Term added to denominator for numerical stability
weight_decay
float
default:"0"
Weight decay (L2 penalty) coefficient
momentum
float
default:"0"
Momentum factor
centered
bool
default:"False"
If True, compute centered RMSProp (gradient normalized by variance estimate)

Example

optimizer = optim.RMSprop(model.parameters(), lr=0.01, alpha=0.99)

Adagrad

Adaptive Gradient Algorithm optimizer.
torch.optim.Adagrad(params, lr=1e-2, lr_decay=0, weight_decay=0,
                    initial_accumulator_value=0, eps=1e-10)

Parameters

params
iterable
required
Iterable of parameters to optimize or dicts defining parameter groups
lr
float | Tensor
default:"1e-2"
Learning rate
lr_decay
float
default:"0"
Learning rate decay
weight_decay
float
default:"0"
Weight decay (L2 penalty) coefficient
initial_accumulator_value
float
default:"0"
Initial value for the gradient accumulator
eps
float
default:"1e-10"
Term added to denominator for numerical stability

Example

optimizer = optim.Adagrad(model.parameters(), lr=0.01)

Other Optimizers

An adaptive learning rate method that uses running averages of squared gradients.
torch.optim.Adadelta(params, lr=1.0, rho=0.9, eps=1e-6, weight_decay=0)
Variant of Adam based on infinity norm.
torch.optim.Adamax(params, lr=2e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=0)
Averaged Stochastic Gradient Descent.
torch.optim.ASGD(params, lr=1e-2, lambd=1e-4, alpha=0.75, t0=1e6, weight_decay=0)
Limited-memory BFGS algorithm (quasi-Newton method).
torch.optim.LBFGS(params, lr=1, max_iter=20, max_eval=None, tolerance_grad=1e-7,
                  tolerance_change=1e-9, history_size=100, line_search_fn=None)
Nesterov-accelerated Adam.
torch.optim.NAdam(params, lr=2e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=0,
                  momentum_decay=4e-3)
Rectified Adam.
torch.optim.RAdam(params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=0)
Resilient Backpropagation.
torch.optim.Rprop(params, lr=1e-2, etas=(0.5, 1.2), step_sizes=(1e-6, 50))
Lazy version of Adam for sparse tensors.
torch.optim.SparseAdam(params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8)

Optimizer Methods

All optimizers implement the following methods:

step()

optimizer.step(closure=None)
Performs a single optimization step.
closure
callable
A closure that reevaluates the model and returns the loss. Required for some optimizers like LBFGS

zero_grad()

optimizer.zero_grad(set_to_none=False)
Clears the gradients of all optimized parameters.
set_to_none
bool
default:"False"
If True, sets gradients to None instead of zero. More memory efficient

state_dict()

state_dict = optimizer.state_dict()
Returns the state of the optimizer as a dict.

load_state_dict()

optimizer.load_state_dict(state_dict)
Loads the optimizer state.

Parameter Groups

Optimizers can handle different learning rates and options for different parameter groups:
optimizer = optim.SGD([
    {'params': model.base.parameters()},
    {'params': model.classifier.parameters(), 'lr': 1e-3}
], lr=1e-2, momentum=0.9)
In this example:
  • model.base parameters use lr=1e-2
  • model.classifier parameters use lr=1e-3
  • Both use momentum=0.9

See Also

Build docs developers (and LLMs) love