Skip to main content

Model Quantization

Quantization reduces the precision of model weights from high-precision formats (32-bit or 16-bit floats) to lower-precision formats (2-bit to 8-bit integers). This dramatically reduces model size and speeds up inference with minimal quality loss.

Why Quantization?

Smaller Model Size

A 70B model goes from 280GB to just 43GB with Q4_K_M quantization - an 85% reduction

Faster Inference

Lower precision means faster computation and higher throughput

Lower Memory

Run larger models on consumer hardware with limited RAM/VRAM

Minimal Quality Loss

Modern quantization methods preserve 95-99% of original model quality

Quantization Types

llama.cpp supports multiple quantization methods, each with different size/quality tradeoffs: K-quants use mixed precision within each tensor block for optimal quality:
  • Q2_K: 2.5-3.5 bits per weight, smallest size, noticeable quality loss
  • Q3_K_S, Q3_K_M, Q3_K_L: 3-4 bits per weight, good for smaller models
  • Q4_K_S, Q4_K_M: 4-5 bits per weight, recommended for most users
  • Q5_K_S, Q5_K_M: 5-6 bits per weight, very good quality
  • Q6_K: 6-7 bits per weight, excellent quality, minimal loss
  • Q8_K: 8 bits per weight, near-original quality
Q4_K_M provides the best balance of size, speed, and quality for most use cases.

I-Quants (Importance Matrix)

I-quants use an importance matrix to preserve the most critical weights:
  • IQ1_S, IQ1_M: ~1.5-2 bits per weight, experimental ultra-low precision
  • IQ2_XXS, IQ2_XS, IQ2_S, IQ2_M: 2-3 bits per weight, better quality than Q2_K
  • IQ3_XXS, IQ3_XS, IQ3_S, IQ3_M: 3-4 bits per weight, better quality than Q3_K
  • IQ4_XS, IQ4_NL: 4-5 bits per weight, competitive with Q4_K
I-quants require an importance matrix file (imatrix) for optimal results. Generate one using llama-imatrix on representative data.

Legacy Quants

Older quantization methods, still supported for compatibility:
  • Q4_0, Q4_1: Simple 4-bit quantization
  • Q5_0, Q5_1: Simple 5-bit quantization
  • Q8_0: Simple 8-bit quantization

Full Precision

  • F32: Full 32-bit floating point (original precision)
  • F16: 16-bit floating point
  • BF16: 16-bit bfloat16 format

Performance Comparison

Real benchmark data from Llama 3.1 8B on Apple M2 Ultra:
QuantizationBits/WeightSize (GiB)Prompt Speed (t/s)Generation Speed (t/s)
IQ1_S2.001.87859 ± 180 ± 1
IQ2_XXS2.382.23852 ± 180 ± 0
IQ2_XS2.592.42827 ± 1378 ± 0
Q2_K3.162.95784 ± 880 ± 0
IQ3_XS3.503.27709 ± 172 ± 1
Q3_K_M4.003.74783 ± 1072 ± 0
Q4_K_M4.894.58822 ± 2172 ± 2
Q5_K_M5.705.33759 ± 767 ± 1
Q6_K6.566.14812 ± 1159 ± 3
Q8_08.507.95865 ± 851 ± 0
F1616.0014.96923 ± 129 ± 0
Key Takeaways:
  • Q4_K_M offers excellent size reduction (4.58 GB vs 14.96 GB) with strong performance
  • Smaller quants (IQ1_S, IQ2_XXS) achieve 80 tokens/s generation at under 2.5 GB
  • Q8_0 provides near-original quality while still reducing size by ~50%

Quality vs Size Tradeoffs

Choosing the right quantization depends on your priorities:
Recommended: Q6_K or Q8_0
  • Best for production applications requiring high accuracy
  • Good for models with complex reasoning capabilities
  • Still 50-60% smaller than F16
  • Example: 70B model = 100-110 GB instead of 280 GB
Recommended: Q4_K_M or Q5_K_M
  • Most popular choice for general use
  • 70-85% size reduction with minimal perceptible quality loss
  • Great speed/quality balance
  • Example: 70B model = 43-53 GB
Recommended: IQ2_XXS, IQ2_XS, or Q3_K_M
  • For resource-constrained environments
  • Acceptable for chat applications and simpler tasks
  • May show quality degradation on complex reasoning
  • Example: 70B model = 22-30 GB
Recommended: IQ1_S or IQ1_M
  • Cutting-edge ultra-low precision
  • For experimentation or very constrained devices
  • Noticeable quality loss on most tasks
  • Example: 70B model = 14-17 GB

How to Quantize Models

Using llama-quantize

# Convert HuggingFace model to F16 GGUF
python3 convert_hf_to_gguf.py ./models/Llama-3.1-8B/

# Quantize to Q4_K_M (recommended)
llama-quantize ./models/Llama-3.1-8B/ggml-model-f16.gguf \
               ./models/Llama-3.1-8B/ggml-model-Q4_K_M.gguf \
               Q4_K_M

With Importance Matrix (I-Quants)

# Generate importance matrix from calibration data
llama-imatrix -m model-f16.gguf \
              -f calibration-data.txt \
              -o imatrix.gguf

# Quantize using importance matrix
llama-quantize --imatrix imatrix.gguf \
               model-f16.gguf \
               model-IQ3_XS.gguf \
               IQ3_XS
Importance matrices help preserve quality during aggressive quantization by identifying and protecting critical weights.

Advanced Options

# Keep output tensor unquantized for better quality
llama-quantize --leave-output-tensor model-f16.gguf model-q4.gguf Q4_K_M

# Pure quantization (no mixed precision)
llama-quantize --pure model-f16.gguf model-q4.gguf Q4_K_M

# Custom quantization for specific tensors
llama-quantize --imatrix imatrix.gguf \
               --output-tensor-type Q5_K \
               --token-embedding-type Q3_K \
               model-f16.gguf model-mixed.gguf Q4_K_M

# Quantize only specific tensor patterns
llama-quantize --tensor-type "attn_v=Q5_K" \
               --tensor-type "ffn_down=Q5_K" \
               model-f16.gguf model-custom.gguf Q4_K_M

Memory Requirements

Llama 3.1 Size Examples

ModelOriginal (F32)Q4_K_MQ2_K
8B32.1 GB4.9 GB2.9 GB
70B280.9 GB43.1 GB26 GB
405B1,625.1 GB249.1 GB150 GB
During quantization, you need RAM equal to at least the original model size plus the quantized output size. For a 70B model, have at least 350GB available.

Understanding the Numbers

What “bits per weight” means

  • Original F32: Each weight = 32 bits (4 bytes)
  • Q4_K_M: Each weight ≈ 4.89 bits (0.6 bytes)
  • Compression ratio: 32 / 4.89 = 6.5x smaller

Mixed Precision

Most quantization methods are “mostly” quantized:
  • 1D tensors (layer norms, biases) stay in higher precision
  • Multi-dimensional tensors (weights) get quantized
  • Some methods keep output/embedding layers in higher precision

Quality Metrics

Quantization quality is measured by: Perplexity (PPL): Lower is better
  • Measures how well the model predicts text
  • Q4_K_M typically adds <3% to perplexity
  • Q2_K may add 10-20% to perplexity
Kullback-Leibler Divergence (KLD): Lower is better
  • Measures difference from original model distribution
  • More sensitive metric than perplexity
Use llama-perplexity to measure:
llama-perplexity -m model.gguf -f test-data.txt
# Output: PPL = 5.4007 +/- 0.67339

Using Hugging Face Spaces

No local setup required:

GGUF-my-repo

Convert any HuggingFace model to GGUF and apply quantization in your browser

GGUF-my-LoRA

Convert LoRA adapters to GGUF format

Best Practices

  1. Start with Q4_K_M - Best balance for most use cases
  2. Use importance matrices for Q3 and below for better quality
  3. Test perplexity on your specific use case
  4. Keep output tensors unquantized for critical applications
  5. Consider Q6_K or Q8_0 for production systems requiring high accuracy
  6. Match quantization to hardware - smaller quants may not be faster on all devices

Further Reading