Model Quantization
Quantization reduces the precision of model weights from high-precision formats (32-bit or 16-bit floats) to lower-precision formats (2-bit to 8-bit integers). This dramatically reduces model size and speeds up inference with minimal quality loss.Why Quantization?
Smaller Model Size
A 70B model goes from 280GB to just 43GB with Q4_K_M quantization - an 85% reduction
Faster Inference
Lower precision means faster computation and higher throughput
Lower Memory
Run larger models on consumer hardware with limited RAM/VRAM
Minimal Quality Loss
Modern quantization methods preserve 95-99% of original model quality
Quantization Types
llama.cpp supports multiple quantization methods, each with different size/quality tradeoffs:K-Quants (Recommended)
K-quants use mixed precision within each tensor block for optimal quality:- Q2_K: 2.5-3.5 bits per weight, smallest size, noticeable quality loss
- Q3_K_S, Q3_K_M, Q3_K_L: 3-4 bits per weight, good for smaller models
- Q4_K_S, Q4_K_M: 4-5 bits per weight, recommended for most users
- Q5_K_S, Q5_K_M: 5-6 bits per weight, very good quality
- Q6_K: 6-7 bits per weight, excellent quality, minimal loss
- Q8_K: 8 bits per weight, near-original quality
Q4_K_M provides the best balance of size, speed, and quality for most use cases.
I-Quants (Importance Matrix)
I-quants use an importance matrix to preserve the most critical weights:- IQ1_S, IQ1_M: ~1.5-2 bits per weight, experimental ultra-low precision
- IQ2_XXS, IQ2_XS, IQ2_S, IQ2_M: 2-3 bits per weight, better quality than Q2_K
- IQ3_XXS, IQ3_XS, IQ3_S, IQ3_M: 3-4 bits per weight, better quality than Q3_K
- IQ4_XS, IQ4_NL: 4-5 bits per weight, competitive with Q4_K
Legacy Quants
Older quantization methods, still supported for compatibility:- Q4_0, Q4_1: Simple 4-bit quantization
- Q5_0, Q5_1: Simple 5-bit quantization
- Q8_0: Simple 8-bit quantization
Full Precision
- F32: Full 32-bit floating point (original precision)
- F16: 16-bit floating point
- BF16: 16-bit bfloat16 format
Performance Comparison
Real benchmark data from Llama 3.1 8B on Apple M2 Ultra:| Quantization | Bits/Weight | Size (GiB) | Prompt Speed (t/s) | Generation Speed (t/s) |
|---|---|---|---|---|
| IQ1_S | 2.00 | 1.87 | 859 ± 1 | 80 ± 1 |
| IQ2_XXS | 2.38 | 2.23 | 852 ± 1 | 80 ± 0 |
| IQ2_XS | 2.59 | 2.42 | 827 ± 13 | 78 ± 0 |
| Q2_K | 3.16 | 2.95 | 784 ± 8 | 80 ± 0 |
| IQ3_XS | 3.50 | 3.27 | 709 ± 1 | 72 ± 1 |
| Q3_K_M | 4.00 | 3.74 | 783 ± 10 | 72 ± 0 |
| Q4_K_M | 4.89 | 4.58 | 822 ± 21 | 72 ± 2 |
| Q5_K_M | 5.70 | 5.33 | 759 ± 7 | 67 ± 1 |
| Q6_K | 6.56 | 6.14 | 812 ± 11 | 59 ± 3 |
| Q8_0 | 8.50 | 7.95 | 865 ± 8 | 51 ± 0 |
| F16 | 16.00 | 14.96 | 923 ± 1 | 29 ± 0 |
Key Takeaways:
- Q4_K_M offers excellent size reduction (4.58 GB vs 14.96 GB) with strong performance
- Smaller quants (IQ1_S, IQ2_XXS) achieve 80 tokens/s generation at under 2.5 GB
- Q8_0 provides near-original quality while still reducing size by ~50%
Quality vs Size Tradeoffs
Choosing the right quantization depends on your priorities:Maximum Quality (Minimal Loss)
Maximum Quality (Minimal Loss)
Recommended: Q6_K or Q8_0
- Best for production applications requiring high accuracy
- Good for models with complex reasoning capabilities
- Still 50-60% smaller than F16
- Example: 70B model = 100-110 GB instead of 280 GB
Balanced (Best Overall)
Balanced (Best Overall)
Recommended: Q4_K_M or Q5_K_M
- Most popular choice for general use
- 70-85% size reduction with minimal perceptible quality loss
- Great speed/quality balance
- Example: 70B model = 43-53 GB
Maximum Compression
Maximum Compression
Recommended: IQ2_XXS, IQ2_XS, or Q3_K_M
- For resource-constrained environments
- Acceptable for chat applications and simpler tasks
- May show quality degradation on complex reasoning
- Example: 70B model = 22-30 GB
Extreme Compression (Experimental)
Extreme Compression (Experimental)
Recommended: IQ1_S or IQ1_M
- Cutting-edge ultra-low precision
- For experimentation or very constrained devices
- Noticeable quality loss on most tasks
- Example: 70B model = 14-17 GB
How to Quantize Models
Using llama-quantize
With Importance Matrix (I-Quants)
Importance matrices help preserve quality during aggressive quantization by identifying and protecting critical weights.
Advanced Options
Memory Requirements
Llama 3.1 Size Examples
| Model | Original (F32) | Q4_K_M | Q2_K |
|---|---|---|---|
| 8B | 32.1 GB | 4.9 GB | 2.9 GB |
| 70B | 280.9 GB | 43.1 GB | 26 GB |
| 405B | 1,625.1 GB | 249.1 GB | 150 GB |
Understanding the Numbers
What “bits per weight” means
- Original F32: Each weight = 32 bits (4 bytes)
- Q4_K_M: Each weight ≈ 4.89 bits (0.6 bytes)
- Compression ratio: 32 / 4.89 = 6.5x smaller
Mixed Precision
Most quantization methods are “mostly” quantized:- 1D tensors (layer norms, biases) stay in higher precision
- Multi-dimensional tensors (weights) get quantized
- Some methods keep output/embedding layers in higher precision
Quality Metrics
Quantization quality is measured by: Perplexity (PPL): Lower is better- Measures how well the model predicts text
- Q4_K_M typically adds <3% to perplexity
- Q2_K may add 10-20% to perplexity
- Measures difference from original model distribution
- More sensitive metric than perplexity
llama-perplexity to measure:
Using Hugging Face Spaces
No local setup required:GGUF-my-repo
Convert any HuggingFace model to GGUF and apply quantization in your browser
GGUF-my-LoRA
Convert LoRA adapters to GGUF format
Best Practices
- Start with Q4_K_M - Best balance for most use cases
- Use importance matrices for Q3 and below for better quality
- Test perplexity on your specific use case
- Keep output tensors unquantized for critical applications
- Consider Q6_K or Q8_0 for production systems requiring high accuracy
- Match quantization to hardware - smaller quants may not be faster on all devices

