- Model size — 4-bit quantization cuts weights from 4 bytes to 0.5 bytes per element.
- Memory bandwidth — the GPU or CPU dequantizes weights on the fly during matrix multiplication.
- Load time — smaller files load faster from disk.
Quantization types
ggml’s quantization types are defined inggml_type. The naming convention is:
Qprefix — classic block quantizationKsuffix — “k-quant” (improved quantization with multiple scales per block)IQprefix — “i-quant” (importance-aware quantization, requires an importance matrix)
Legacy Q-types (Q4_0, Q4_1, Q5_0, Q5_1, Q8_0)
Legacy Q-types (Q4_0, Q4_1, Q5_0, Q5_1, Q8_0)
The original ggml quantization formats. Each block stores a shared scale (and optionally a minimum) for a fixed number of elements.
| Type | Bits/weight | Notes |
|---|---|---|
GGML_TYPE_Q4_0 | 4 | Scale per 32-element block |
GGML_TYPE_Q4_1 | 4.25 | Scale + minimum per block |
GGML_TYPE_Q5_0 | 5 | Scale per block |
GGML_TYPE_Q5_1 | 5.25 | Scale + minimum per block |
GGML_TYPE_Q8_0 | 8 | Scale per 32-element block |
K-quants (Q2_K … Q8_K)
K-quants (Q2_K … Q8_K)
K-quants use a hierarchy of super-blocks and sub-blocks with multiple scales, giving significantly better accuracy at the same bit-width.
The
| Type | Bits/weight | Notes |
|---|---|---|
GGML_TYPE_Q2_K | 2.5 | 256-element super-blocks |
GGML_TYPE_Q3_K | 3.4 | |
GGML_TYPE_Q4_K | 4.5 | Preferred 4-bit format |
GGML_TYPE_Q5_K | 5.5 | |
GGML_TYPE_Q6_K | 6.6 | Near-lossless for most models |
GGML_TYPE_Q8_K | 8 | Internal use (dot product accumulation) |
_S (small) and _M (medium) suffixes used in llama.cpp refer to mixed-precision strategies built on top of these types, not separate ggml_type values.I-quants (IQ1_S, IQ2_XXS, IQ3_S, IQ4_NL, …)
I-quants (IQ1_S, IQ2_XXS, IQ3_S, IQ4_NL, …)
I-quants use non-uniform (importance-weighted) quantization grids. They achieve better perplexity than equivalent k-quants at the same bit-width, but require an importance matrix during quantization.
| Type | Approx. bits/weight |
|---|---|
GGML_TYPE_IQ1_S | 1.56 |
GGML_TYPE_IQ1_M | 1.75 |
GGML_TYPE_IQ2_XXS | 2.06 |
GGML_TYPE_IQ2_XS | 2.31 |
GGML_TYPE_IQ2_S | 2.5 |
GGML_TYPE_IQ3_XXS | 3.06 |
GGML_TYPE_IQ3_S | 3.44 |
GGML_TYPE_IQ4_NL | 4.5 |
GGML_TYPE_IQ4_XS | 4.25 |
Checking whether a type is quantized
Quantizing data with ggml_quantize_chunk
ggml_quantize_chunk is the primary entry point for converting F32 data to a quantized format:
dst.
Example: quantize a weight matrix
Initialization and cleanup
ggml_quantize_chunk calls ggml_quantize_init internally. If you need explicit control over when quantization tables are loaded:
Importance matrices (imatrix)
An importance matrix calibrates which weight values have the most impact on model outputs. Providing one during quantization allows the quantizer to allocate more precision to high-importance values. The imatrix has shape[n_per_row] — one importance score per column of the weight matrix:
Mixed precision
In a typical LLM deployment:| Tensor | Type | Reason |
|---|---|---|
| Weight matrices (large 2-D) | Q4_K, Q5_K, Q6_K, IQ4_XS | Memory savings |
| Embedding tables | Q4_K or F16 | Rows accessed sparsely |
| Normalization weights | F32 | Small, sensitive to precision |
| KV cache | F16 or Q8_0 | Balances bandwidth and accuracy |
| Activations | F16 or F32 | Full precision avoids error accumulation |
ggml_mul_mat handles dequantization internally: the left-hand operand can be any quantized type while the right-hand operand is typically F16 or F32.
