Skip to main content
Quantization reduces the precision of model weights, shrinking model size and speeding up inference with minimal quality loss. This is essential for running large language models on consumer hardware.

What is Quantization?

Quantization converts high-precision model weights (32-bit or 16-bit floats) to lower precision formats (2-8 bits). For example:
  • Original F32: 26 GB for a 7B model
  • F16: 14 GB (50% reduction)
  • Q4_K_M: ~4.5 GB (83% reduction)
  • Q2_K: ~3 GB (88% reduction)
The tradeoff is small accuracy loss, measured in perplexity (ppl). With proper quantization, this loss is often negligible.

Quick Start

The llama-quantize tool converts GGUF models from high precision to quantized formats:
# Quantize to Q4_K_M (recommended for most users)
./llama-quantize model-f16.gguf model-q4.gguf Q4_K_M

# Quantize to Q5_K_M (higher quality)
./llama-quantize model-f16.gguf model-q5.gguf Q5_K_M

# Quantize to Q8_0 (near-original quality)
./llama-quantize model-f16.gguf model-q8.gguf Q8_0

Quantization Types

llama.cpp supports many quantization methods. Here are the most important ones:
Size: ~5.3 GB for 7B model
Quality: +0.06 ppl @ Llama-3-8B
Speed: Slightly slower than Q4
Best for: When quality is more important than size, users with more RAM
./llama-quantize model-f16.gguf model-q5km.gguf Q5_K_M
Noticeably better quality than Q4 with only ~20% size increase.
Size: ~8 GB for 7B model
Quality: +0.003 ppl @ Llama-3-8B
Speed: Moderate
Best for: When maximum quality is needed, enough RAM available
./llama-quantize model-f16.gguf model-q8.gguf Q8_0
Minimal quality loss compared to F16, good for validation.
Q2_K Size: ~3 GB for 7B model (+3.5 ppl)
Q3_K_M Size: ~3.7 GB for 7B model (+0.7 ppl)
Speed: Very fast
Best for: Very limited RAM, mobile devices, when size is critical
./llama-quantize model-f16.gguf model-q2k.gguf Q2_K
./llama-quantize model-f16.gguf model-q3k.gguf Q3_K_M
Noticeable quality degradation but still functional.

Complete Quantization List

For reference, here’s the complete list of supported quantization types:
TypeBits/WeightSize (7B)PerplexityDescription
IQ1_S1.56~1.5 GB-Experimental 1-bit
IQ1_M1.75~1.7 GB-1-bit variant
IQ2_XXS2.06~2.0 GB-Ultra-compressed
IQ2_XS2.31~2.2 GB-2-bit extra-small
IQ2_S2.50~2.4 GB-2-bit small
IQ2_M2.70~2.6 GB-2-bit medium
Q2_K2.96~2.8 GB+3.522-bit k-quant
Q2_K_S2.96~2.8 GB+3.182-bit k-quant small
IQ3_XXS3.06~2.9 GB-3-bit ultra-small
IQ3_XS3.30~3.1 GB-3-bit extra-small
IQ3_S3.44~3.2 GB-3-bit small
Q3_K_S3.41~3.2 GB+1.633-bit k-quant small
IQ3_M3.66~3.5 GB-3-bit medium mix
Q3_K_M3.74~3.5 GB+0.663-bit balanced
Q3_K_L4.03~3.8 GB+0.563-bit large
IQ4_XS4.25~4.0 GB-4-bit extra-small
Q4_04.34~4.1 GB+0.47Legacy 4-bit
IQ4_NL4.50~4.3 GB-4-bit non-linear
Q4_14.78~4.5 GB+0.45Legacy 4-bit variant
Q4_K_S4.37~4.1 GB+0.274-bit k-quant small
Q4_K_M4.58~4.3 GB+0.184-bit balanced
Q5_05.21~4.9 GB+0.13Legacy 5-bit
Q5_15.65~5.3 GB+0.11Legacy 5-bit variant
Q5_K_S5.21~4.9 GB+0.105-bit k-quant small
Q5_K_M5.33~5.0 GB+0.065-bit balanced
Q6_K6.14~5.8 GB+0.026-bit k-quant
Q8_08.50~8.0 GB+0.0038-bit quantization
F1616.00~14 GB+0.002Half precision
BF1616.00~14 GB-0.005BFloat16
F3232.00~26 GBbaselineFull precision
Perplexity values are from Llama-3-8B benchmarks. Lower perplexity increase = better quality. The “K” variants (Q4_K_M, Q5_K_M) use importance matrix techniques for better quality.

Advanced Quantization

Using Importance Matrix (imatrix)

Importance matrix quantization uses statistical data from real prompts to minimize quality loss:
1

Generate Importance Matrix

First, create an imatrix file by running text through the model:
# Generate imatrix from a text file
./llama-imatrix -m model-f16.gguf -f calibration-data.txt -o imatrix.dat
The calibration data should be representative of your actual use case.
2

Quantize with imatrix

Use the imatrix during quantization for better results:
./llama-quantize --imatrix imatrix.dat model-f16.gguf model-q4.gguf Q4_K_M
This typically reduces perplexity by 10-30% compared to naive quantization.
Using an importance matrix is highly recommended for quantization levels below Q5_K_M, as it significantly improves quality.

Advanced Options

Quantize different parts of the model to different levels:
# Keep attention layers at higher precision
./llama-quantize \
  --tensor-type "attn_v=q5_k" \
  --tensor-type "attn_q=q5_k" \
  model-f16.gguf model-mixed.gguf Q4_K_M
Useful for preserving quality in critical layers while saving size elsewhere.
Control quantization of the output projection:
# Leave output tensor unquantized for better quality
./llama-quantize --leave-output-tensor model-f16.gguf model-q4.gguf Q4_K_M

# Or use specific quantization for output
./llama-quantize --output-tensor-type q6_k model-f16.gguf model-q4.gguf Q4_K_M
The output tensor significantly affects generation quality.
Special quantization for token embeddings:
# Use Q3_K for embeddings to save size
./llama-quantize --token-embedding-type q3_k model-f16.gguf model-q4.gguf Q4_K_M
Embeddings can often be more aggressively quantized.
Quantize all tensors to the exact same type:
# No mixed precision - everything becomes Q4_K
./llama-quantize --pure model-f16.gguf model-q4.gguf Q4_K
By default, some tensors use different quantization for quality. --pure disables this.

Requantization

You can requantize an already-quantized model, though quality loss accumulates:
# Requantize Q4 to Q5 (not recommended - better to start from F16)
./llama-quantize --allow-requantize model-q4.gguf model-q5.gguf Q5_K_M
Warning: Requantization severely degrades quality. Always quantize from F16 or F32 when possible.

Complete Workflow Example

Here’s a complete example from raw model to optimized GGUF:
# 1. Download model from Hugging Face
huggingface-cli download meta-llama/Llama-3.1-8B \
  --local-dir ./models/llama-3.1-8b

# 2. Install dependencies
cd llama.cpp
python3 -m pip install -r requirements.txt

# 3. Convert to GGUF F16
python3 convert_hf_to_gguf.py ../models/llama-3.1-8b/

# 4. Quantize to Q4_K_M
./llama-quantize \
  ../models/llama-3.1-8b/ggml-model-f16.gguf \
  ../models/llama-3.1-8b/ggml-model-Q4_K_M.gguf \
  Q4_K_M

# 5. Run the quantized model
./llama-cli -m ../models/llama-3.1-8b/ggml-model-Q4_K_M.gguf \
  -p "You are a helpful assistant" -cnv

Memory and Disk Requirements

Quantization requires enough memory and disk space for both input and output files:
Model SizeF16 InputQ4_K_M OutputRAM NeededTime (approx)
1B2 GB0.7 GB4 GB<1 min
7B14 GB4.5 GB16 GB2-5 min
13B26 GB8 GB32 GB5-10 min
34B68 GB21 GB80 GB15-30 min
70B140 GB43 GB160 GB30-60 min
405B810 GB249 GB1 TB2-4 hours
You need enough disk space for both the input and output files simultaneously. RAM usage is typically close to the output file size.

Online Quantization

If you don’t have sufficient hardware, use the GGUF-my-repo Hugging Face space:
  1. Visit https://huggingface.co/spaces/ggml-org/gguf-my-repo
  2. Enter your model repository
  3. Select quantization levels (multiple at once)
  4. The space converts and quantizes automatically
  5. Results are published to your Hugging Face account
This is free and uses Hugging Face’s infrastructure.

Choosing the Right Quantization

Decision Tree

1

Determine Your Constraints

RAM/VRAM available?
  • <8 GB: Use Q2_K or Q3_K_M
  • 8-16 GB: Use Q4_K_M
  • 16-32 GB: Use Q5_K_M or Q6_K
  • 32+ GB: Use Q8_0 or F16
2

Assess Quality Needs

How important is quality?
  • Maximum quality: Q8_0 or F16
  • High quality: Q5_K_M or Q6_K
  • Balanced: Q4_K_M ⭐
  • Size-constrained: Q3_K_M
  • Extreme compression: Q2_K
3

Consider Use Case

What’s your use case?
  • Production/chat: Q4_K_M or Q5_K_M
  • Development/testing: Q4_K_M
  • Mobile/edge: Q2_K or Q3_K_M
  • Research/benchmarking: Q8_0 or F16

Recommendations by Model Size

Recommended: Q4_K_M or Q5_K_MSmall models are already efficient, so use higher quantization to preserve quality. The size savings from aggressive quantization aren’t as meaningful.
./llama-quantize model-1b-f16.gguf model-1b-q5.gguf Q5_K_M
Recommended: Q4_K_MThis is the sweet spot for Q4_K_M quantization. You get ~75% size reduction with minimal quality loss.
./llama-quantize model-7b-f16.gguf model-7b-q4.gguf Q4_K_M
Recommended: Q3_K_M or Q4_K_MSize becomes critical for large models. Q3_K_M provides good compression while maintaining usable quality.
./llama-quantize model-70b-f16.gguf model-70b-q3.gguf Q3_K_M
Use Q4_K_M if you have the RAM.
Recommended: Q2_K or Q3_K_MFor models this large, aggressive quantization is often necessary just to fit in memory.
./llama-quantize model-405b-f16.gguf model-405b-q2.gguf Q2_K
Consider using importance matrix to improve Q2_K quality.

Evaluating Quality

Measure quantization quality using perplexity:
# Test on a validation dataset
./llama-perplexity -m model-q4.gguf -f validation.txt

# Compare to original
./llama-perplexity -m model-f16.gguf -f validation.txt
Lower perplexity = better quality. A small increase (0.1-0.5) is usually acceptable.

Troubleshooting

Solution: Use a smaller quantization level or quantize on a machine with more RAM. Alternatively, use the online GGUF-my-repo tool.
Possible causes:
  • Quantization level too aggressive (Q2_K or lower)
  • Corrupted quantization process
  • Wrong model format
Solution: Try Q4_K_M or higher, or requantize from original F16.
Error message: error: quantizing already quantized modelSolution: Add --allow-requantize flag, but note this degrades quality. Better to quantize from F16.
Solution: Use more CPU threads:
./llama-quantize model.gguf output.gguf Q4_K_M 16
The last argument specifies thread count.

Next Steps

After quantization:
  1. Test the model to ensure quality is acceptable
  2. Benchmark performance with llama-bench
  3. Deploy using llama-server or integrate into your application
  4. Share your quantized model on Hugging Face for others
See also: