Model Quantization

Quantization reduces the precision of model weights from high-precision formats (32-bit or 16-bit floats) to lower-precision formats (2-bit to 8-bit integers). This dramatically reduces model size and speeds up inference with minimal quality loss.

Why Quantization?

Smaller Model Size

A 70B model goes from 280GB to just 43GB with Q4_K_M quantization - an 85% reduction

Faster Inference

Lower precision means faster computation and higher throughput

Lower Memory

Run larger models on consumer hardware with limited RAM/VRAM

Minimal Quality Loss

Modern quantization methods preserve 95-99% of original model quality

Quantization Types

llama.cpp supports multiple quantization methods, each with different size/quality tradeoffs:

K-Quants (Recommended)

K-quants use mixed precision within each tensor block for optimal quality:

Q2_K: 2.5-3.5 bits per weight, smallest size, noticeable quality loss
Q3_K_S, Q3_K_M, Q3_K_L: 3-4 bits per weight, good for smaller models
Q4_K_S, Q4_K_M: 4-5 bits per weight, recommended for most users
Q5_K_S, Q5_K_M: 5-6 bits per weight, very good quality
Q6_K: 6-7 bits per weight, excellent quality, minimal loss
Q8_K: 8 bits per weight, near-original quality

Q4_K_M provides the best balance of size, speed, and quality for most use cases.

I-Quants (Importance Matrix)

I-quants use an importance matrix to preserve the most critical weights:

IQ1_S, IQ1_M: ~1.5-2 bits per weight, experimental ultra-low precision
IQ2_XXS, IQ2_XS, IQ2_S, IQ2_M: 2-3 bits per weight, better quality than Q2_K
IQ3_XXS, IQ3_XS, IQ3_S, IQ3_M: 3-4 bits per weight, better quality than Q3_K
IQ4_XS, IQ4_NL: 4-5 bits per weight, competitive with Q4_K

I-quants require an importance matrix file (imatrix) for optimal results. Generate one using llama-imatrix on representative data.

Legacy Quants

Older quantization methods, still supported for compatibility:

Q4_0, Q4_1: Simple 4-bit quantization
Q5_0, Q5_1: Simple 5-bit quantization
Q8_0: Simple 8-bit quantization

Full Precision

F32: Full 32-bit floating point (original precision)
F16: 16-bit floating point
BF16: 16-bit bfloat16 format

Performance Comparison

Real benchmark data from Llama 3.1 8B on Apple M2 Ultra:

Quantization	Bits/Weight	Size (GiB)	Prompt Speed (t/s)	Generation Speed (t/s)
IQ1_S	2.00	1.87	859 ± 1	80 ± 1
IQ2_XXS	2.38	2.23	852 ± 1	80 ± 0
IQ2_XS	2.59	2.42	827 ± 13	78 ± 0
Q2_K	3.16	2.95	784 ± 8	80 ± 0
IQ3_XS	3.50	3.27	709 ± 1	72 ± 1
Q3_K_M	4.00	3.74	783 ± 10	72 ± 0
Q4_K_M	4.89	4.58	822 ± 21	72 ± 2
Q5_K_M	5.70	5.33	759 ± 7	67 ± 1
Q6_K	6.56	6.14	812 ± 11	59 ± 3
Q8_0	8.50	7.95	865 ± 8	51 ± 0
F16	16.00	14.96	923 ± 1	29 ± 0

Key Takeaways:

Q4_K_M offers excellent size reduction (4.58 GB vs 14.96 GB) with strong performance
Smaller quants (IQ1_S, IQ2_XXS) achieve 80 tokens/s generation at under 2.5 GB
Q8_0 provides near-original quality while still reducing size by ~50%

Quality vs Size Tradeoffs

Choosing the right quantization depends on your priorities:

Maximum Quality (Minimal Loss)

Recommended: Q6_K or Q8_0

Best for production applications requiring high accuracy
Good for models with complex reasoning capabilities
Still 50-60% smaller than F16
Example: 70B model = 100-110 GB instead of 280 GB

Balanced (Best Overall)

Recommended: Q4_K_M or Q5_K_M

Most popular choice for general use
70-85% size reduction with minimal perceptible quality loss
Great speed/quality balance
Example: 70B model = 43-53 GB

Maximum Compression

Recommended: IQ2_XXS, IQ2_XS, or Q3_K_M

For resource-constrained environments
Acceptable for chat applications and simpler tasks
May show quality degradation on complex reasoning
Example: 70B model = 22-30 GB

Extreme Compression (Experimental)

Recommended: IQ1_S or IQ1_M

Cutting-edge ultra-low precision
For experimentation or very constrained devices
Noticeable quality loss on most tasks
Example: 70B model = 14-17 GB

How to Quantize Models

Using llama-quantize

# Convert HuggingFace model to F16 GGUF
python3 convert_hf_to_gguf.py ./models/Llama-3.1-8B/

# Quantize to Q4_K_M (recommended)
llama-quantize ./models/Llama-3.1-8B/ggml-model-f16.gguf \
               ./models/Llama-3.1-8B/ggml-model-Q4_K_M.gguf \
               Q4_K_M

With Importance Matrix (I-Quants)

# Generate importance matrix from calibration data
llama-imatrix -m model-f16.gguf \
              -f calibration-data.txt \
              -o imatrix.gguf

# Quantize using importance matrix
llama-quantize --imatrix imatrix.gguf \
               model-f16.gguf \
               model-IQ3_XS.gguf \
               IQ3_XS

Importance matrices help preserve quality during aggressive quantization by identifying and protecting critical weights.

Advanced Options

# Keep output tensor unquantized for better quality
llama-quantize --leave-output-tensor model-f16.gguf model-q4.gguf Q4_K_M

# Pure quantization (no mixed precision)
llama-quantize --pure model-f16.gguf model-q4.gguf Q4_K_M

# Custom quantization for specific tensors
llama-quantize --imatrix imatrix.gguf \
               --output-tensor-type Q5_K \
               --token-embedding-type Q3_K \
               model-f16.gguf model-mixed.gguf Q4_K_M

# Quantize only specific tensor patterns
llama-quantize --tensor-type "attn_v=Q5_K" \
               --tensor-type "ffn_down=Q5_K" \
               model-f16.gguf model-custom.gguf Q4_K_M

Memory Requirements

Llama 3.1 Size Examples

Model	Original (F32)	Q4_K_M	Q2_K
8B	32.1 GB	4.9 GB	2.9 GB
70B	280.9 GB	43.1 GB	26 GB
405B	1,625.1 GB	249.1 GB	150 GB

During quantization, you need RAM equal to at least the original model size plus the quantized output size. For a 70B model, have at least 350GB available.

Understanding the Numbers

What “bits per weight” means

Original F32: Each weight = 32 bits (4 bytes)
Q4_K_M: Each weight ≈ 4.89 bits (0.6 bytes)
Compression ratio: 32 / 4.89 = 6.5x smaller

Mixed Precision

Most quantization methods are “mostly” quantized:

1D tensors (layer norms, biases) stay in higher precision
Multi-dimensional tensors (weights) get quantized
Some methods keep output/embedding layers in higher precision

Quality Metrics

Quantization quality is measured by: Perplexity (PPL): Lower is better

Measures how well the model predicts text
Q4_K_M typically adds <3% to perplexity
Q2_K may add 10-20% to perplexity

Kullback-Leibler Divergence (KLD): Lower is better

Measures difference from original model distribution
More sensitive metric than perplexity

Use llama-perplexity to measure:

llama-perplexity -m model.gguf -f test-data.txt
# Output: PPL = 5.4007 +/- 0.67339

Using Hugging Face Spaces

No local setup required:

GGUF-my-repo

Convert any HuggingFace model to GGUF and apply quantization in your browser

GGUF-my-LoRA

Convert LoRA adapters to GGUF format

Best Practices

Start with Q4_K_M - Best balance for most use cases
Use importance matrices for Q3 and below for better quality
Test perplexity on your specific use case
Keep output tensors unquantized for critical applications
Consider Q6_K or Q8_0 for production systems requiring high accuracy
Match quantization to hardware - smaller quants may not be faster on all devices

Get Started

Core Concepts

Inference

Models

Advanced

Model Quantization

Model Quantization

Why Quantization?

Smaller Model Size

Faster Inference

Lower Memory

Minimal Quality Loss

Quantization Types

K-Quants (Recommended)

I-Quants (Importance Matrix)

Legacy Quants

Full Precision

Performance Comparison

Quality vs Size Tradeoffs

How to Quantize Models

Using llama-quantize

With Importance Matrix (I-Quants)

Advanced Options

Memory Requirements

Llama 3.1 Size Examples

Understanding the Numbers

What “bits per weight” means

Mixed Precision

Quality Metrics

Using Hugging Face Spaces

GGUF-my-repo

GGUF-my-LoRA

Best Practices

Further Reading

Get Started

Core Concepts

Inference

Models

Advanced

​Model Quantization

​Why Quantization?

Smaller Model Size

Faster Inference

Lower Memory

Minimal Quality Loss

​Quantization Types

​K-Quants (Recommended)

​I-Quants (Importance Matrix)

​Legacy Quants

​Full Precision

​Performance Comparison

​Quality vs Size Tradeoffs

​How to Quantize Models

​Using llama-quantize

​With Importance Matrix (I-Quants)

​Advanced Options

​Memory Requirements

​Llama 3.1 Size Examples

​Understanding the Numbers

​What “bits per weight” means

​Mixed Precision

​Quality Metrics

​Using Hugging Face Spaces

GGUF-my-repo

GGUF-my-LoRA

​Best Practices

​Further Reading

Model Quantization

Why Quantization?

Quantization Types

K-Quants (Recommended)

I-Quants (Importance Matrix)

Legacy Quants

Full Precision

Performance Comparison

Quality vs Size Tradeoffs

How to Quantize Models

Using llama-quantize

With Importance Matrix (I-Quants)

Advanced Options

Memory Requirements

Llama 3.1 Size Examples

Understanding the Numbers

What “bits per weight” means

Mixed Precision

Quality Metrics

Using Hugging Face Spaces

Best Practices

Further Reading