ggml provides two built-in optimizers: AdamW and SGD. Both are configured through the ggml_opt_optimizer_params struct and supplied to the optimizer context via a callback.
Optimizer types
enum ggml_opt_optimizer_type {
GGML_OPT_OPTIMIZER_TYPE_ADAMW,
GGML_OPT_OPTIMIZER_TYPE_SGD,
};
AdamW is the recommended default for most deep learning tasks. It maintains per-parameter first and second moment estimates and applies decoupled weight decay.struct ggml_opt_optimizer_params params;
params.adamw.alpha = 0.001f; // learning rate
params.adamw.beta1 = 0.9f; // first moment decay (momentum)
params.adamw.beta2 = 0.999f; // second moment decay
params.adamw.eps = 1e-8f; // epsilon for numerical stability
params.adamw.wd = 0.1f; // weight decay (0.0f to disable)
| Field | Description |
|---|
alpha | Learning rate. Controls the step size applied to each parameter update. |
beta1 | Exponential decay rate for the first moment (mean of gradients). Typical value: 0.9. |
beta2 | Exponential decay rate for the second moment (uncentered variance of gradients). Typical value: 0.999. |
eps | Small constant added to the denominator to prevent division by zero. Typical value: 1e-8. |
wd | Weight decay coefficient. Applied directly to parameters (decoupled from the gradient update). Set to 0.0f to disable. |
AdamW requires two additional momentum tensors (m and v) per trainable parameter tensor. This increases memory usage relative to SGD.
SGD (stochastic gradient descent) is a simpler optimizer with lower memory overhead. It applies a scaled gradient update with optional weight decay.struct ggml_opt_optimizer_params params;
params.sgd.alpha = 0.01f; // learning rate
params.sgd.wd = 0.0f; // weight decay (0.0f to disable)
| Field | Description |
|---|
alpha | Learning rate. |
wd | Weight decay coefficient. Set to 0.0f to disable. |
Optimizer params callbacks
The optimizer does not read ggml_opt_optimizer_params directly. Instead, it calls a ggml_opt_get_optimizer_params callback before each backward pass, allowing you to change hyperparameters dynamically during training (for example, to implement a learning rate schedule).
// Callback signature
typedef struct ggml_opt_optimizer_params (*ggml_opt_get_optimizer_params)(void * userdata);
The userdata pointer carries arbitrary context to the callback. When using ggml_opt_fit, userdata is a pointer to the current epoch number (int64_t *).
Built-in callbacks
// Returns hard-coded default values. userdata is ignored.
struct ggml_opt_optimizer_params ggml_opt_get_default_optimizer_params(void * userdata);
// Casts userdata to ggml_opt_optimizer_params * and returns the pointed-to struct.
struct ggml_opt_optimizer_params ggml_opt_get_constant_optimizer_params(void * userdata);
Use ggml_opt_get_constant_optimizer_params when you want to supply fixed hyperparameters without writing a custom callback:
struct ggml_opt_optimizer_params my_params;
my_params.adamw.alpha = 3e-4f;
my_params.adamw.beta1 = 0.9f;
my_params.adamw.beta2 = 0.999f;
my_params.adamw.eps = 1e-8f;
my_params.adamw.wd = 0.01f;
ggml_opt_fit(
sched, ctx_compute, inputs, outputs, dataset,
GGML_OPT_LOSS_TYPE_CROSS_ENTROPY,
GGML_OPT_OPTIMIZER_TYPE_ADAMW,
ggml_opt_get_constant_optimizer_params, // callback
&my_params, // passed as userdata
nepoch, nbatch_logical, val_split, silent
);
Custom learning rate schedule
Because ggml_opt_fit passes a pointer to the current epoch as userdata, you can implement epoch-dependent schedules:
struct ggml_opt_optimizer_params lr_schedule(void * userdata) {
int64_t epoch = *(int64_t *) userdata;
// Linear warmup for the first 5 epochs, then constant
float base_lr = 1e-3f;
float lr = (epoch < 5) ? base_lr * ((float)(epoch + 1) / 5.0f) : base_lr;
struct ggml_opt_optimizer_params params;
params.adamw.alpha = lr;
params.adamw.beta1 = 0.9f;
params.adamw.beta2 = 0.999f;
params.adamw.eps = 1e-8f;
params.adamw.wd = 0.1f;
return params;
}
// Pass the callback to ggml_opt_fit
ggml_opt_fit(
sched, ctx_compute, inputs, outputs, dataset,
GGML_OPT_LOSS_TYPE_CROSS_ENTROPY,
GGML_OPT_OPTIMIZER_TYPE_ADAMW,
lr_schedule, // custom callback
NULL, // userdata — ggml_opt_fit supplies the epoch pointer automatically
nepoch, nbatch_logical, val_split, silent
);
When using ggml_opt_epoch directly (instead of ggml_opt_fit), you are responsible for calling your callback and passing userdata. The epoch pointer convention only applies to ggml_opt_fit.
ggml_opt_params struct
ggml_opt_params configures the full optimization context, including backend, loss, build type, and optimizer.
struct ggml_opt_params {
ggml_backend_sched_t backend_sched; // backend scheduler for compute graphs
// static graph allocation — set all three or leave all NULL for dynamic
struct ggml_context * ctx_compute;
struct ggml_tensor * inputs;
struct ggml_tensor * outputs;
enum ggml_opt_loss_type loss_type;
enum ggml_opt_build_type build_type;
int32_t opt_period; // optimizer steps after this many gradient accumulation steps
ggml_opt_get_optimizer_params get_opt_pars; // optimizer params callback
void * get_opt_pars_ud; // userdata for the callback
enum ggml_opt_optimizer_type optimizer;
};
Use ggml_opt_default_params to get a struct with sensible defaults, then override individual fields:
struct ggml_opt_params params = ggml_opt_default_params(
backend_sched,
GGML_OPT_LOSS_TYPE_CROSS_ENTROPY
);
params.optimizer = GGML_OPT_OPTIMIZER_TYPE_ADAMW;
params.opt_period = 4; // accumulate 4 batches before each optimizer step
params.get_opt_pars = lr_schedule;
| Field | Description |
|---|
backend_sched | Defines which backends are used to construct and execute compute graphs. |
ctx_compute | Compute context for static graph allocation. Leave NULL for dynamic allocation. |
inputs / outputs | Input and output tensors for static graph allocation. Leave NULL for dynamic allocation. |
loss_type | Loss function to minimize during training. |
build_type | Controls which graphs are built: FORWARD, GRAD, or OPT. Default for training is OPT. |
opt_period | Number of gradient accumulation micro-steps between optimizer parameter updates. |
get_opt_pars | Callback to retrieve optimizer hyperparameters before each backward pass. |
get_opt_pars_ud | Arbitrary pointer passed as userdata to get_opt_pars. |
optimizer | Optimizer algorithm: ADAMW or SGD. |
Context lifecycle
// Initialize an optimizer context from params
ggml_opt_context_t opt_ctx = ggml_opt_init(params);
// Free all resources associated with the context
ggml_opt_free(opt_ctx);
// Reset gradients and loss; pass true to also reset optimizer state
// (e.g. clear Adam momentum accumulators between training runs)
ggml_opt_reset(opt_ctx, /*optimizer=*/false);
ggml_opt_reset with optimizer = false clears accumulated gradients and resets the loss scalar without discarding the optimizer’s internal momentum state. Pass true to perform a full reset, which is equivalent to starting a fresh training run with the same graph.