Skip to main content
For LLM inference the dominant cost is not floating-point compute — it is memory bandwidth. A 7B-parameter model stored as F32 occupies ~28 GB. Moving that data from DRAM to compute cores is the primary bottleneck on every platform. Quantization maps 32-bit floats to smaller integer representations, reducing:
  • Model size — 4-bit quantization cuts weights from 4 bytes to 0.5 bytes per element.
  • Memory bandwidth — the GPU or CPU dequantizes weights on the fly during matrix multiplication.
  • Load time — smaller files load faster from disk.
Activations (the intermediate tensors produced during inference) are typically kept at F16 or F32 to preserve accuracy.

Quantization types

ggml’s quantization types are defined in ggml_type. The naming convention is:
  • Q prefix — classic block quantization
  • K suffix — “k-quant” (improved quantization with multiple scales per block)
  • IQ prefix — “i-quant” (importance-aware quantization, requires an importance matrix)
The original ggml quantization formats. Each block stores a shared scale (and optionally a minimum) for a fixed number of elements.
TypeBits/weightNotes
GGML_TYPE_Q4_04Scale per 32-element block
GGML_TYPE_Q4_14.25Scale + minimum per block
GGML_TYPE_Q5_05Scale per block
GGML_TYPE_Q5_15.25Scale + minimum per block
GGML_TYPE_Q8_08Scale per 32-element block
K-quants use a hierarchy of super-blocks and sub-blocks with multiple scales, giving significantly better accuracy at the same bit-width.
TypeBits/weightNotes
GGML_TYPE_Q2_K2.5256-element super-blocks
GGML_TYPE_Q3_K3.4
GGML_TYPE_Q4_K4.5Preferred 4-bit format
GGML_TYPE_Q5_K5.5
GGML_TYPE_Q6_K6.6Near-lossless for most models
GGML_TYPE_Q8_K8Internal use (dot product accumulation)
The _S (small) and _M (medium) suffixes used in llama.cpp refer to mixed-precision strategies built on top of these types, not separate ggml_type values.
I-quants use non-uniform (importance-weighted) quantization grids. They achieve better perplexity than equivalent k-quants at the same bit-width, but require an importance matrix during quantization.
TypeApprox. bits/weight
GGML_TYPE_IQ1_S1.56
GGML_TYPE_IQ1_M1.75
GGML_TYPE_IQ2_XXS2.06
GGML_TYPE_IQ2_XS2.31
GGML_TYPE_IQ2_S2.5
GGML_TYPE_IQ3_XXS3.06
GGML_TYPE_IQ3_S3.44
GGML_TYPE_IQ4_NL4.5
GGML_TYPE_IQ4_XS4.25

Checking whether a type is quantized

bool ggml_is_quantized(enum ggml_type type);

// Example
if (ggml_is_quantized(tensor->type)) {
    printf("quantized: %s\n", ggml_type_name(tensor->type));
}

Quantizing data with ggml_quantize_chunk

ggml_quantize_chunk is the primary entry point for converting F32 data to a quantized format:
size_t ggml_quantize_chunk(
    enum ggml_type   type,       // target quantization type
    const float    * src,        // source F32 data
    void           * dst,        // destination buffer (pre-allocated)
    int64_t          start,      // first row index to quantize
    int64_t          nrows,      // number of rows to quantize
    int64_t          n_per_row,  // number of elements per row
    const float    * imatrix);   // importance matrix (NULL if not needed)
The function returns the number of bytes written to dst.
Some quantization types (all IQ types) require a non-NULL imatrix. Call ggml_quantize_requires_imatrix(type) to check before passing NULL.

Example: quantize a weight matrix

const int64_t n_rows      = 4096;
const int64_t n_per_row   = 4096;
const int64_t total_elems = n_rows * n_per_row;

float * src = ...; // F32 weights, shape [n_per_row, n_rows]

// Allocate destination buffer using ggml_row_size to get the correct size
size_t dst_size = ggml_row_size(GGML_TYPE_Q4_K, n_per_row) * n_rows;
void * dst = malloc(dst_size);

ggml_quantize_chunk(
    GGML_TYPE_Q4_K,
    src,
    dst,
    /*start=*/0,
    n_rows,
    n_per_row,
    /*imatrix=*/NULL);

Initialization and cleanup

ggml_quantize_chunk calls ggml_quantize_init internally. If you need explicit control over when quantization tables are loaded:
ggml_quantize_init(GGML_TYPE_Q4_K);  // load tables once
// ... quantize many chunks ...
ggml_quantize_free();                 // release table memory
Both functions are thread-safe.

Importance matrices (imatrix)

An importance matrix calibrates which weight values have the most impact on model outputs. Providing one during quantization allows the quantizer to allocate more precision to high-importance values. The imatrix has shape [n_per_row] — one importance score per column of the weight matrix:
float imatrix[n_per_row] = { ... }; // higher = more important

ggml_quantize_chunk(
    GGML_TYPE_IQ4_XS,
    src, dst,
    /*start=*/0, n_rows, n_per_row,
    imatrix);  // required for IQ types, beneficial for K-quants
In practice, imatrices are computed by running a calibration dataset through the unquantized model and collecting activation statistics.

Mixed precision

In a typical LLM deployment:
TensorTypeReason
Weight matrices (large 2-D)Q4_K, Q5_K, Q6_K, IQ4_XSMemory savings
Embedding tablesQ4_K or F16Rows accessed sparsely
Normalization weightsF32Small, sensitive to precision
KV cacheF16 or Q8_0Balances bandwidth and accuracy
ActivationsF16 or F32Full precision avoids error accumulation
ggml_mul_mat handles dequantization internally: the left-hand operand can be any quantized type while the right-hand operand is typically F16 or F32.

Type traits

You can inspect quantization properties at runtime:
const struct ggml_type_traits * tt = ggml_get_type_traits(GGML_TYPE_Q4_K);

printf("name:         %s\n",  tt->type_name);
printf("block size:   %lld\n", tt->blck_size);  // elements per block
printf("bytes/block:  %zu\n",  tt->type_size);
printf("is_quantized: %d\n",   tt->is_quantized);

Build docs developers (and LLMs) love