What is Quantization?
Quantization converts high-precision model weights (32-bit or 16-bit floats) to lower precision formats (2-8 bits). For example:- Original F32: 26 GB for a 7B model
- F16: 14 GB (50% reduction)
- Q4_K_M: ~4.5 GB (83% reduction)
- Q2_K: ~3 GB (88% reduction)
Quick Start
Thellama-quantize tool converts GGUF models from high precision to quantized formats:
Quantization Types
llama.cpp supports many quantization methods. Here are the most important ones:Recommended Quantization Levels
Q4_K_M - Best Balance (Recommended)
Q4_K_M - Best Balance (Recommended)
Size: ~4.5 GB for 7B model
Quality: +0.18 ppl @ Llama-3-8B
Speed: Fast inferenceBest for: Most users, production deployments, good quality-to-size ratioThis is the default recommendation for most use cases.
Quality: +0.18 ppl @ Llama-3-8B
Speed: Fast inferenceBest for: Most users, production deployments, good quality-to-size ratio
Q5_K_M - Higher Quality
Q5_K_M - Higher Quality
Size: ~5.3 GB for 7B model
Quality: +0.06 ppl @ Llama-3-8B
Speed: Slightly slower than Q4Best for: When quality is more important than size, users with more RAMNoticeably better quality than Q4 with only ~20% size increase.
Quality: +0.06 ppl @ Llama-3-8B
Speed: Slightly slower than Q4Best for: When quality is more important than size, users with more RAM
Q8_0 - Near Original
Q8_0 - Near Original
Size: ~8 GB for 7B model
Quality: +0.003 ppl @ Llama-3-8B
Speed: ModerateBest for: When maximum quality is needed, enough RAM availableMinimal quality loss compared to F16, good for validation.
Quality: +0.003 ppl @ Llama-3-8B
Speed: ModerateBest for: When maximum quality is needed, enough RAM available
Q2_K / Q3_K - Smallest Size
Q2_K / Q3_K - Smallest Size
Q2_K Size: ~3 GB for 7B model (+3.5 ppl)
Q3_K_M Size: ~3.7 GB for 7B model (+0.7 ppl)
Speed: Very fastBest for: Very limited RAM, mobile devices, when size is criticalNoticeable quality degradation but still functional.
Q3_K_M Size: ~3.7 GB for 7B model (+0.7 ppl)
Speed: Very fastBest for: Very limited RAM, mobile devices, when size is critical
Complete Quantization List
For reference, here’s the complete list of supported quantization types:| Type | Bits/Weight | Size (7B) | Perplexity | Description |
|---|---|---|---|---|
| IQ1_S | 1.56 | ~1.5 GB | - | Experimental 1-bit |
| IQ1_M | 1.75 | ~1.7 GB | - | 1-bit variant |
| IQ2_XXS | 2.06 | ~2.0 GB | - | Ultra-compressed |
| IQ2_XS | 2.31 | ~2.2 GB | - | 2-bit extra-small |
| IQ2_S | 2.50 | ~2.4 GB | - | 2-bit small |
| IQ2_M | 2.70 | ~2.6 GB | - | 2-bit medium |
| Q2_K | 2.96 | ~2.8 GB | +3.52 | 2-bit k-quant |
| Q2_K_S | 2.96 | ~2.8 GB | +3.18 | 2-bit k-quant small |
| IQ3_XXS | 3.06 | ~2.9 GB | - | 3-bit ultra-small |
| IQ3_XS | 3.30 | ~3.1 GB | - | 3-bit extra-small |
| IQ3_S | 3.44 | ~3.2 GB | - | 3-bit small |
| Q3_K_S | 3.41 | ~3.2 GB | +1.63 | 3-bit k-quant small |
| IQ3_M | 3.66 | ~3.5 GB | - | 3-bit medium mix |
| Q3_K_M | 3.74 | ~3.5 GB | +0.66 | 3-bit balanced |
| Q3_K_L | 4.03 | ~3.8 GB | +0.56 | 3-bit large |
| IQ4_XS | 4.25 | ~4.0 GB | - | 4-bit extra-small |
| Q4_0 | 4.34 | ~4.1 GB | +0.47 | Legacy 4-bit |
| IQ4_NL | 4.50 | ~4.3 GB | - | 4-bit non-linear |
| Q4_1 | 4.78 | ~4.5 GB | +0.45 | Legacy 4-bit variant |
| Q4_K_S | 4.37 | ~4.1 GB | +0.27 | 4-bit k-quant small |
| Q4_K_M | 4.58 | ~4.3 GB | +0.18 | 4-bit balanced ⭐ |
| Q5_0 | 5.21 | ~4.9 GB | +0.13 | Legacy 5-bit |
| Q5_1 | 5.65 | ~5.3 GB | +0.11 | Legacy 5-bit variant |
| Q5_K_S | 5.21 | ~4.9 GB | +0.10 | 5-bit k-quant small |
| Q5_K_M | 5.33 | ~5.0 GB | +0.06 | 5-bit balanced |
| Q6_K | 6.14 | ~5.8 GB | +0.02 | 6-bit k-quant |
| Q8_0 | 8.50 | ~8.0 GB | +0.003 | 8-bit quantization |
| F16 | 16.00 | ~14 GB | +0.002 | Half precision |
| BF16 | 16.00 | ~14 GB | -0.005 | BFloat16 |
| F32 | 32.00 | ~26 GB | baseline | Full precision |
Perplexity values are from Llama-3-8B benchmarks. Lower perplexity increase = better quality. The “K” variants (Q4_K_M, Q5_K_M) use importance matrix techniques for better quality.
Advanced Quantization
Using Importance Matrix (imatrix)
Importance matrix quantization uses statistical data from real prompts to minimize quality loss:Generate Importance Matrix
First, create an imatrix file by running text through the model:The calibration data should be representative of your actual use case.
Using an importance matrix is highly recommended for quantization levels below Q5_K_M, as it significantly improves quality.
Advanced Options
Selective Tensor Quantization
Selective Tensor Quantization
Quantize different parts of the model to different levels:Useful for preserving quality in critical layers while saving size elsewhere.
Output Tensor Control
Output Tensor Control
Control quantization of the output projection:The output tensor significantly affects generation quality.
Token Embedding Control
Token Embedding Control
Special quantization for token embeddings:Embeddings can often be more aggressively quantized.
Pure Quantization
Pure Quantization
Quantize all tensors to the exact same type:By default, some tensors use different quantization for quality.
--pure disables this.Requantization
You can requantize an already-quantized model, though quality loss accumulates:Warning: Requantization severely degrades quality. Always quantize from F16 or F32 when possible.
Complete Workflow Example
Here’s a complete example from raw model to optimized GGUF:Memory and Disk Requirements
Quantization requires enough memory and disk space for both input and output files:| Model Size | F16 Input | Q4_K_M Output | RAM Needed | Time (approx) |
|---|---|---|---|---|
| 1B | 2 GB | 0.7 GB | 4 GB | <1 min |
| 7B | 14 GB | 4.5 GB | 16 GB | 2-5 min |
| 13B | 26 GB | 8 GB | 32 GB | 5-10 min |
| 34B | 68 GB | 21 GB | 80 GB | 15-30 min |
| 70B | 140 GB | 43 GB | 160 GB | 30-60 min |
| 405B | 810 GB | 249 GB | 1 TB | 2-4 hours |
You need enough disk space for both the input and output files simultaneously. RAM usage is typically close to the output file size.
Online Quantization
If you don’t have sufficient hardware, use the GGUF-my-repo Hugging Face space:- Visit https://huggingface.co/spaces/ggml-org/gguf-my-repo
- Enter your model repository
- Select quantization levels (multiple at once)
- The space converts and quantizes automatically
- Results are published to your Hugging Face account
Choosing the Right Quantization
Decision Tree
Determine Your Constraints
RAM/VRAM available?
- <8 GB: Use Q2_K or Q3_K_M
- 8-16 GB: Use Q4_K_M
- 16-32 GB: Use Q5_K_M or Q6_K
- 32+ GB: Use Q8_0 or F16
Assess Quality Needs
How important is quality?
- Maximum quality: Q8_0 or F16
- High quality: Q5_K_M or Q6_K
- Balanced: Q4_K_M ⭐
- Size-constrained: Q3_K_M
- Extreme compression: Q2_K
Recommendations by Model Size
Small Models (1B-3B)
Small Models (1B-3B)
Recommended: Q4_K_M or Q5_K_MSmall models are already efficient, so use higher quantization to preserve quality. The size savings from aggressive quantization aren’t as meaningful.
Medium Models (7B-13B)
Medium Models (7B-13B)
Recommended: Q4_K_MThis is the sweet spot for Q4_K_M quantization. You get ~75% size reduction with minimal quality loss.
Large Models (30B-70B)
Large Models (30B-70B)
Recommended: Q3_K_M or Q4_K_MSize becomes critical for large models. Q3_K_M provides good compression while maintaining usable quality.Use Q4_K_M if you have the RAM.
Huge Models (100B+)
Huge Models (100B+)
Recommended: Q2_K or Q3_K_MFor models this large, aggressive quantization is often necessary just to fit in memory.Consider using importance matrix to improve Q2_K quality.
Evaluating Quality
Measure quantization quality using perplexity:Troubleshooting
Out of memory during quantization
Out of memory during quantization
Solution: Use a smaller quantization level or quantize on a machine with more RAM. Alternatively, use the online GGUF-my-repo tool.
Quantized model gives nonsensical output
Quantized model gives nonsensical output
Possible causes:
- Quantization level too aggressive (Q2_K or lower)
- Corrupted quantization process
- Wrong model format
Cannot requantize an already quantized model
Cannot requantize an already quantized model
Error message:
error: quantizing already quantized modelSolution: Add --allow-requantize flag, but note this degrades quality. Better to quantize from F16.Quantization is very slow
Quantization is very slow
Solution: Use more CPU threads:The last argument specifies thread count.
Next Steps
After quantization:- Test the model to ensure quality is acceptable
- Benchmark performance with
llama-bench - Deploy using
llama-serveror integrate into your application - Share your quantized model on Hugging Face for others
- Supported Models - Check compatibility
- Converting Models - Get models into GGUF format
- Obtaining Models - Find pre-quantized models

