Skip to main content

Overview

llama-perplexity is a tool for measuring the perplexity and other quality metrics of language models over text corpora. It’s primarily used to evaluate quantization quality loss and compare model performance.

What is Perplexity?

Perplexity measures how well a model predicts the next token:
  • Lower values = better prediction
  • Indicates model “surprise” at seeing the actual next token
  • Used to compare quantized models against FP16 baseline
  • Not directly comparable between different models or tokenizers
Perplexity is a technical metric for judging quantization quality, not end-user model quality. Finetunes may have higher perplexity but better human-rated outputs.

Quick Start

llama-perplexity -m model.gguf -f wikitext-2-raw/wiki.test.raw

# Output:
# [1]15.2701,[2]5.4007,[3]5.3073,[4]6.2965,[5]5.8940,...
# Final estimate: PPL = 5.4007 +/- 0.67339

Basic Usage

Measure Perplexity

llama-perplexity -m model.gguf -f test-corpus.txt
Outputs:
  • Progressive perplexity per chunk
  • Final mean perplexity ± uncertainty
  • Uncertainty calculated via Gaussian distribution assumption

Command-Line Options

-m, --model
string
Path to the GGUF model file.
-f, --file
string
Text file containing the test corpus.
-c, --ctx-size
integer
Context size for evaluation.
-b, --batch-size
integer
Batch size for processing.
-ngl, --n-gpu-layers
integer
Number of layers to offload to GPU.

Advanced Analysis: KL Divergence

Why KL Divergence?

Kullback-Leibler (KL) divergence measures how similar two probability distributions are:
  • KL = 0: Distributions are identical
  • Higher values: More difference between models
  • Used to compare quantized model to FP16 reference

Two-Step Process

1

Record FP16 baseline

First, record logits from the FP16 model:
llama-perplexity -m model-f16.gguf \
  -f wikitext-2-raw/wiki.test.raw \
  --kl-divergence-base model-f16.kld
The .kld file will be very large:
  • LLaMA 2: ~11 GiB
  • LLaMA 3: ~37 GiB
    (for Wikitext-2 test set)
2

Compare quantized model

Then compare the quantized model against the baseline:
llama-perplexity -m model-q4_k_m.gguf \
  -f wikitext-2-raw/wiki.test.raw \
  --kl-divergence-base model-f16.kld \
  --kl-divergence

KL Divergence Output

With --kl-divergence, you get comprehensive statistics:
Mean PPL(Q)                    :      6.407115 ± 0.039119
Mean PPL(base)                 :      6.231634 ± 0.037833
Cor(ln(PPL(Q)), ln(PPL(base))) :                  99.340%
Mean ln(PPL(Q)/PPL(base))      :      0.027704 ± 0.000713
Mean PPL(Q)/PPL(base)          :      1.028160 ± 0.000723
Mean PPL(Q)-PPL(base)          :      0.175482 ± 0.004620
Mean KLD                       :  0.03127339 ± 0.00023848
Mean Δp                        :    -0.596 ± 0.014 %
RMS Δp                         :     5.519 ± 0.050 %
Same top p                     :    91.901 ± 0.072 %

Understanding Metrics

Perplexity Ratio

Mean PPL(Q)/PPL(base) = 1.028160 ± 0.000723
  • Ratio of quantized to FP16 perplexity
  • Closer to 1.0 = less quality loss
  • Values > 1.0 indicate degradation

Mean Δp (Change in Token Probability)

Mean Δp = -0.596 ± 0.014 %
  • Average change in correct token probability
  • Positive: Model improved (rare)
  • Negative: Model degraded
  • Close to 0%: Minimal impact

RMS Δp (Root Mean Square Change)

RMS Δp = 5.519 ± 0.050 %
Think of this as “noise level” from quantization:
  • Lower is better
  • Indicates overall distribution shift
  • Related to Gaussian noise assumption

Same Top p

Same top p = 91.901 ± 0.072 %
  • Percentage of time both models agree on the most likely token
  • Higher is better
  • Practical indicator of consistency

Percentile Analysis

The tool also reports change in token probability at various percentiles:
99.9% Δp    :             27.084%
99.0% Δp    :             12.084%
Median Δp   :             -0.024%
1.0% Δp     :            -19.567%
0.1% Δp     :            -56.054%
Minimum Δp  :            -98.699%
  • Symmetric distribution: Quantization adds random noise
  • Asymmetric (more negative): Actual quality degradation
  • Helps distinguish noise from systematic errors

Benchmarking Quantizations

Standard Test Setup

Llama.cpp contributors use this standard:
  1. Dataset: Wikitext-2 test set
  2. Baseline: FP16 model
  3. Method: KL divergence comparison

Example: Compare Q4 Quantizations

# Record baseline once
llama-perplexity -m model-f16.gguf \
  -f wikitext-2-raw/wiki.test.raw \
  --kl-divergence-base baseline.kld

# Test Q4_0
llama-perplexity -m model-q4_0.gguf \
  -f wikitext-2-raw/wiki.test.raw \
  --kl-divergence-base baseline.kld \
  --kl-divergence

# Test Q4_K_M
llama-perplexity -m model-q4_k_m.gguf \
  -f wikitext-2-raw/wiki.test.raw \
  --kl-divergence-base baseline.kld \
  --kl-divergence

# Test Q4_K_S
llama-perplexity -m model-q4_k_s.gguf \
  -f wikitext-2-raw/wiki.test.raw \
  --kl-divergence-base baseline.kld \
  --kl-divergence

Interpreting Results

Lower KLD and higher “Same top p” indicate better quantization:
QuantizationKLDMean ΔpSame top p
Q8_00.00136-0.019%97.67%
Q6_K0.00545-0.007%96.03%
Q5_K_M0.01076-0.114%94.35%
Q4_K_M0.03127-0.596%91.90%
Q4_00.07194-1.588%87.42%
These are example numbers. Actual results vary by model architecture and content.

Importance Matrices

Some quantizations support importance matrices for better quality:
# Create importance matrix from training data
llama-imatrix -m model-f16.gguf \
  -f training-data.txt \
  -o importance.dat

# Use during quantization
llama-quantize --imatrix importance.dat \
  model-f16.gguf model-q4_k_m.gguf Q4_K_M

# Test the result
llama-perplexity -m model-q4_k_m.gguf \
  -f wikitext-2-raw/wiki.test.raw \
  --kl-divergence-base baseline.kld \
  --kl-divergence
Importance matrices can significantly improve quality for some quantization types, especially at lower bit counts.

Example: LLaMA 3 8B Results

From the official llama.cpp benchmarks:
QuantizationModel SizePPLKLDMean ΔpRMS ΔpSame top p
f1614.97 GiB6.2332 ± .040.000550.001%0.787%99.87%
q8_07.96 GiB6.2343 ± .040.00136-0.019%1.198%97.67%
q6_K6.14 GiB6.2534 ± .040.00545-0.007%2.295%96.03%
q5_K_M5.33 GiB6.2886 ± .040.01076-0.114%3.160%94.35%
q4_K_M4.58 GiB6.4071 ± .040.03127-0.596%5.519%91.90%
q4_04.34 GiB6.7001 ± .040.07194-1.588%8.434%87.42%
q3_K_M3.74 GiB6.8885 ± .040.10191-1.990%10.203%83.52%
q2_K2.96 GiB9.7516 ± .060.44513-9.123%21.421%71.14%

Practical Guidelines

1

Choose baseline

Use FP16 or BF16 as your reference model.
2

Record logits

Create the baseline .kld file once:
llama-perplexity -m model-f16.gguf \
  -f wikitext-2-raw/wiki.test.raw \
  --kl-divergence-base baseline.kld
3

Test quantizations

Compare each quantized version:
for quant in q8_0 q6_k q5_k_m q4_k_m q4_0; do
  echo "Testing $quant..."
  llama-perplexity -m model-$quant.gguf \
    -f wikitext-2-raw/wiki.test.raw \
    --kl-divergence-base baseline.kld \
    --kl-divergence | tee results-$quant.txt
done
4

Compare results

Look at:
  • KL divergence: Overall distribution similarity
  • Same top p: Practical consistency
  • Mean Δp: Average quality change
  • Percentiles: Noise vs degradation

Limitations & Notes

Important Limitations
  • Perplexity is not comparable between different models
  • Different tokenizers produce different perplexity values
  • Finetunes often have higher perplexity but better quality
  • Results are implementation-specific (llama.cpp vs other frameworks)
  • Use the same test set for all comparisons

Performance Considerations

Perplexity calculation can be slow:
# Use GPU offloading
llama-perplexity -m model.gguf -f test.txt -ngl 99

# Adjust batch size
llama-perplexity -m model.gguf -f test.txt -b 512

# Smaller context for faster testing
llama-perplexity -m model.gguf -f test.txt -c 2048

See Also