Overview
llama-perplexity is a tool for measuring the perplexity and other quality metrics of language models over text corpora. It’s primarily used to evaluate quantization quality loss and compare model performance.
What is Perplexity?
Perplexity measures how well a model predicts the next token:- Lower values = better prediction
- Indicates model “surprise” at seeing the actual next token
- Used to compare quantized models against FP16 baseline
- Not directly comparable between different models or tokenizers
Perplexity is a technical metric for judging quantization quality, not end-user model quality. Finetunes may have higher perplexity but better human-rated outputs.
Quick Start
Basic Usage
Measure Perplexity
- Progressive perplexity per chunk
- Final mean perplexity ± uncertainty
- Uncertainty calculated via Gaussian distribution assumption
Command-Line Options
Path to the GGUF model file.
Text file containing the test corpus.
Context size for evaluation.
Batch size for processing.
Number of layers to offload to GPU.
Advanced Analysis: KL Divergence
Why KL Divergence?
Kullback-Leibler (KL) divergence measures how similar two probability distributions are:- KL = 0: Distributions are identical
- Higher values: More difference between models
- Used to compare quantized model to FP16 reference
Two-Step Process
Record FP16 baseline
First, record logits from the FP16 model:
The
.kld file will be very large:- LLaMA 2: ~11 GiB
- LLaMA 3: ~37 GiB
(for Wikitext-2 test set)
KL Divergence Output
With--kl-divergence, you get comprehensive statistics:
Understanding Metrics
Perplexity Ratio
- Ratio of quantized to FP16 perplexity
- Closer to 1.0 = less quality loss
- Values > 1.0 indicate degradation
Mean Δp (Change in Token Probability)
- Average change in correct token probability
- Positive: Model improved (rare)
- Negative: Model degraded
- Close to 0%: Minimal impact
RMS Δp (Root Mean Square Change)
- Lower is better
- Indicates overall distribution shift
- Related to Gaussian noise assumption
Same Top p
- Percentage of time both models agree on the most likely token
- Higher is better
- Practical indicator of consistency
Percentile Analysis
The tool also reports change in token probability at various percentiles:- Symmetric distribution: Quantization adds random noise
- Asymmetric (more negative): Actual quality degradation
- Helps distinguish noise from systematic errors
Benchmarking Quantizations
Standard Test Setup
Llama.cpp contributors use this standard:- Dataset: Wikitext-2 test set
- Baseline: FP16 model
- Method: KL divergence comparison
Example: Compare Q4 Quantizations
Interpreting Results
Lower KLD and higher “Same top p” indicate better quantization:| Quantization | KLD | Mean Δp | Same top p |
|---|---|---|---|
| Q8_0 | 0.00136 | -0.019% | 97.67% |
| Q6_K | 0.00545 | -0.007% | 96.03% |
| Q5_K_M | 0.01076 | -0.114% | 94.35% |
| Q4_K_M | 0.03127 | -0.596% | 91.90% |
| Q4_0 | 0.07194 | -1.588% | 87.42% |
These are example numbers. Actual results vary by model architecture and content.
Importance Matrices
Some quantizations support importance matrices for better quality:Example: LLaMA 3 8B Results
From the official llama.cpp benchmarks:| Quantization | Model Size | PPL | KLD | Mean Δp | RMS Δp | Same top p |
|---|---|---|---|---|---|---|
| f16 | 14.97 GiB | 6.2332 ± .04 | 0.00055 | 0.001% | 0.787% | 99.87% |
| q8_0 | 7.96 GiB | 6.2343 ± .04 | 0.00136 | -0.019% | 1.198% | 97.67% |
| q6_K | 6.14 GiB | 6.2534 ± .04 | 0.00545 | -0.007% | 2.295% | 96.03% |
| q5_K_M | 5.33 GiB | 6.2886 ± .04 | 0.01076 | -0.114% | 3.160% | 94.35% |
| q4_K_M | 4.58 GiB | 6.4071 ± .04 | 0.03127 | -0.596% | 5.519% | 91.90% |
| q4_0 | 4.34 GiB | 6.7001 ± .04 | 0.07194 | -1.588% | 8.434% | 87.42% |
| q3_K_M | 3.74 GiB | 6.8885 ± .04 | 0.10191 | -1.990% | 10.203% | 83.52% |
| q2_K | 2.96 GiB | 9.7516 ± .06 | 0.44513 | -9.123% | 21.421% | 71.14% |
Practical Guidelines
Limitations & Notes
Important Limitations
- Perplexity is not comparable between different models
- Different tokenizers produce different perplexity values
- Finetunes often have higher perplexity but better quality
- Results are implementation-specific (llama.cpp vs other frameworks)
- Use the same test set for all comparisons
Performance Considerations
Perplexity calculation can be slow:See Also
- llama-bench - Speed and throughput benchmarking
- Quantization Guide
- Perplexity Documentation

