struct ggml_context * as their first argument and return a struct ggml_tensor * representing the result. Operations do not perform any computation — they record a node in the computation graph. Computation only happens when ggml_graph_compute() or ggml_graph_compute_with_ctx() is called.
Most operations have an _inplace variant that writes results back into the first tensor operand, returning a view of it.
Arithmetic
Arithmetic
ggml_add
a + b. b is broadcast to the shape of a when necessary.ggml_add1
b to every element of a.ggml_sub
a - b.ggml_mul
a * b (Hadamard product). b is broadcast to the shape of a.ggml_div
a / b.ggml_sqr
a².ggml_sqrt
√a.ggml_abs
|a|.ggml_neg
-a.ggml_log
ln(a).ggml_exp
eᵃ.ggml_sin / ggml_cos
ggml_scale
a by the scalar s. Equivalent to a * s.ggml_clamp
a to [min, max]. Operates in-place and returns a view of a.Matrix operations
Matrix operations
ggml_mul_mat
a is the weight matrix (k columns, n rows) and b is the input (k columns, m rows — transposed internally). The result is n columns by m rows.a:[ne03, ne02, n, k]b:[ne03*x, ne02*y, m, k]- result:
[ne03*x, ne02*y, m, n]
a may be quantized; b must be F32 or F16.ggml_mul_mat_set_prec
ggml_mul_mat result tensor. Set to GGML_PREC_F32 for higher-precision accumulation (useful for models like Phi-2).ggml_mul_mat_id
as using the row indices in ids, then multiplies by b. Used in mixture-of-experts routing.ggml_out_prod
a is [m, n], b is [p, n], result is [m, p].Activation functions
Activation functions
ggml_relu
max(0, a) element-wise.ggml_leaky_relu
a >= 0 ? a : negative_slope * a.ggml_gelu
tanh.ggml_gelu_erf
erf) when available. Some backends may fall back to the Abramowitz and Stegun approximation.ggml_gelu_quick
ggml_silu
a * sigmoid(a).ggml_silu_back
dx given x and dy.ggml_sigmoid
1 / (1 + exp(-a)).ggml_tanh
ggml_elu
a >= 0 ? a : exp(a) - 1.ggml_hardswish / ggml_hardsigmoid
hardswish(x) = x * relu6(x + 3) / 6hardsigmoid(x) = relu6(x + 3) / 6
Gated linear units
ggml provides fused GLU variants that split or gate the activation in a single op:Normalization
Normalization
ggml_norm
eps is added to the variance before taking the square root for numerical stability.ggml_rms_norm
ggml_l2_norm
ggml_group_norm
ne0 * ne1 / n_groups channels. Commonly used in image models such as Stable Diffusion.Number of channel groups to normalize over.
Small constant added to the variance for numerical stability.
Attention
Attention
ggml_flash_attn_ext
q:[n_embd_k, n_batch, n_head, ne3]k:[n_embd_k, n_kv, n_head_kv, ne3]v:[n_embd_v, n_kv, n_head_kv, ne3]— not pre-transposedmask:[n_kv, n_batch, ne32, ne33]— F16 or F32, optional- result:
[n_embd_v, n_head, n_batch, ne3]— permuted
Attention scaling factor applied before softmax. Typically
1/sqrt(head_dim).Maximum ALiBi slope. Set to
0.0 to disable ALiBi bias.Soft-cap applied to logits as
tanh(logit / cap) * cap. Set to 0.0 to disable.GGML_PREC_F32).ggml_soft_max_ext
softmax(a * scale + mask * alibi_slope).Reshape and view
Reshape and view
ggml_reshape_1d / _2d / _3d / _4d
a with the specified shape. Total element count must match. a must be contiguous.ggml_view_1d / _2d / _3d / _4d
a starting at offset bytes. Strides can differ from a, enabling sub-matrix and strided views without copying.ggml_transpose
a. Equivalent to ggml_permute(ctx, a, 1, 0, 2, 3). Returns a view; no data is copied.ggml_permute
a. For example, ggml_permute(ctx, a, 2, 1, 0, 3) moves dimension 2 to position 0. Returns a non-contiguous view; no data is copied.ggml_cont
a if it is not already contiguous. Variants ggml_cont_1d through ggml_cont_4d also reshape while making contiguous.Reduction
Reduction
ggml_sum
ggml_sum_rows
[a, b, c, d] → output shape [1, b, c, d].ggml_mean
ggml_argmax
ggml_top_k
k elements per row. The returned indices are not in sorted order.Use
ggml_argsort if you need fully sorted rows.ggml_argsort
ggml_cumsum
Convolution
Convolution
ggml_conv_1d
b with kernel a.Convolution kernel tensor.
Input data tensor.
Stride along dimension 0.
Padding along dimension 0.
Dilation along dimension 0.
ggml_conv_2d
ggml_im2col + ggml_mul_mat.Embedding and positional encoding
Embedding and positional encoding
ggml_get_rows
a by the integer indices stored in b. Used for token embedding lookup.Result shape: [n_embd, n_rows, ne2, ne3].ggml_rope
a. b is a 1D tensor of position indices.ggml_rope_ext
ggml_rope_custom.Optional per-dimension frequency scaling factors. Pass
NULL to use default RoPE frequencies.Original training context length. Used to compute YaRN correction dimensions.
Base frequency for the sinusoidal position encoding (e.g.
10000.0).YaRN extrapolation factor. Set to
0.0 to disable YaRN.Loss functions
Loss functions
ggml_cross_entropy_loss
a and ground-truth labels b. The result is a scalar tensor. Mark it with ggml_set_loss() to use it as the optimization objective.Concatenation and repetition
Concatenation and repetition
Diagonal and masking
Diagonal and masking
