Skip to main content

Overview

llama-bench is a comprehensive performance testing tool for llama.cpp that measures inference speed, throughput, and resource utilization. It’s designed to help you optimize model configurations and compare performance across different settings.

Quick Start

# Run with defaults
llama-bench -m model.gguf

# Output:
# | model               |       size |     params | backend    | threads |          test |                  t/s |
# | ------------------- | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
# | qwen2 1.5B Q4_0     | 885.97 MiB |     1.54 B | Metal,BLAS |      16 |         pp512 |      5765.41 ± 20.55 |
# | qwen2 1.5B Q4_0     | 885.97 MiB |     1.54 B | Metal,BLAS |      16 |         tg128 |        197.71 ± 0.81 |

Test Types

llama-bench performs three types of tests:

Prompt Processing (pp)

Measures how fast the model processes a prompt in batches.
llama-bench -m model.gguf -p 512 -n 0

Text Generation (tg)

Measures token generation speed.
llama-bench -m model.gguf -p 0 -n 128

Combined (pg)

Measures prompt processing followed by text generation.
llama-bench -m model.gguf -pg 512,128

Command-Line Options

Basic Options

-h, --help
flag
Display help message and exit.
-r, --repetitions
integer
default:"5"
Number of times to repeat each test for averaging.
--delay
integer
default:"0"
Delay between each test in seconds.
-o, --output
string
default:"md"
Output format: md (markdown), csv, json, jsonl, or sql.
-oe, --output-err
string
Output format for stderr (same options as -o).
-v, --verbose
flag
Enable verbose output.
--progress
flag
Print test progress indicators.

Test Parameters

-m, --model
string
Path to model file. Can specify multiple models.Default: models/7B/ggml-model-q4_0.gguf
-p, --n-prompt
integer
default:"512"
Number of prompt tokens for prompt processing test.
-n, --n-gen
integer
default:"128"
Number of tokens to generate for text generation test.
-pg
string
Combined prompt processing and text generation test.Format: pp,tg (e.g., -pg 512,128)
-d, --n-depth
integer
default:"0"
Context depth: prefill KV cache with this many tokens before testing.

Performance Options

-b, --batch-size
integer
default:"2048"
Logical batch size.
-ub, --ubatch-size
integer
default:"512"
Physical batch size.
-t, --threads
integer
Number of CPU threads. Can specify multiple values.
-ngl, --n-gpu-layers
integer
default:"99"
Number of layers to offload to GPU.
-sm, --split-mode
string
default:"layer"
How to split model across GPUs: none, layer, or row.
-fa, --flash-attn
boolean
default:"0"
Enable Flash Attention (0 or 1).

Usage Examples

Compare Different Models

llama-bench \
  -m models/7B/ggml-model-q4_0.gguf \
  -m models/13B/ggml-model-q4_0.gguf \
  -p 0 -n 128,256,512
Output:
modelsizeparamsbackendngltestt/s
llama 7B mostly Q4_03.56 GiB6.74 BCUDA99tg 128132.19 ± 0.55
llama 7B mostly Q4_03.56 GiB6.74 BCUDA99tg 256129.37 ± 0.54
llama 7B mostly Q4_03.56 GiB6.74 BCUDA99tg 512123.83 ± 0.25
llama 13B mostly Q4_06.86 GiB13.02 BCUDA99tg 12882.17 ± 0.31
llama 13B mostly Q4_06.86 GiB13.02 BCUDA99tg 25680.74 ± 0.23
llama 13B mostly Q4_06.86 GiB13.02 BCUDA99tg 51278.08 ± 0.07

Test Batch Size Impact

llama-bench -n 0 -p 1024 -b 128,256,512,1024
Output:
modelsizeparamsbackendngln_batchtestt/s
llama 7B mostly Q4_03.56 GiB6.74 BCUDA99128pp 10241436.51 ± 3.66
llama 7B mostly Q4_03.56 GiB6.74 BCUDA99256pp 10241932.43 ± 23.48
llama 7B mostly Q4_03.56 GiB6.74 BCUDA99512pp 10242254.45 ± 15.59
llama 7B mostly Q4_03.56 GiB6.74 BCUDA991024pp 10242498.61 ± 13.58

Test Thread Scaling

llama-bench -n 0 -n 16 -p 64 -t 1,2,4,8,16,32
modelsizeparamsbackendthreadstestt/s
llama 7B mostly Q4_03.56 GiB6.74 BCPU1pp 646.17 ± 0.07
llama 7B mostly Q4_03.56 GiB6.74 BCPU1tg 164.05 ± 0.02
llama 7B mostly Q4_03.56 GiB6.74 BCPU2pp 6412.31 ± 0.13
llama 7B mostly Q4_03.56 GiB6.74 BCPU2tg 167.80 ± 0.07
llama 7B mostly Q4_03.56 GiB6.74 BCPU4pp 6423.18 ± 0.06
llama 7B mostly Q4_03.56 GiB6.74 BCPU4tg 1612.22 ± 0.07

Test GPU Layer Offloading

llama-bench -ngl 10,20,30,31,32,33,34,35
modelsizeparamsbackendngltestt/s
llama 7B mostly Q4_03.56 GiB6.74 BCUDA10pp 512373.36 ± 2.25
llama 7B mostly Q4_03.56 GiB6.74 BCUDA10tg 12813.45 ± 0.93
llama 7B mostly Q4_03.56 GiB6.74 BCUDA20pp 512472.65 ± 1.25
llama 7B mostly Q4_03.56 GiB6.74 BCUDA20tg 12821.36 ± 1.94
llama 7B mostly Q4_03.56 GiB6.74 BCUDA30pp 512631.87 ± 11.25
llama 7B mostly Q4_03.56 GiB6.74 BCUDA30tg 12840.04 ± 1.82

Test Prefilled Context

Test performance with warm KV cache:
llama-bench -d 0,512
modelsizeparamsbackendngltestt/s
qwen2 7B Q4_K - Medium4.36 GiB7.62 BCUDA99pp5127340.20 ± 23.45
qwen2 7B Q4_K - Medium4.36 GiB7.62 BCUDA99tg128120.60 ± 0.59
qwen2 7B Q4_K - Medium4.36 GiB7.62 BCUDA99pp512 @ d5126425.91 ± 18.88
qwen2 7B Q4_K - Medium4.36 GiB7.62 BCUDA99tg128 @ d512116.71 ± 0.60

Multiple Values & Ranges

You can specify multiple values in three ways:

Comma-Separated

llama-bench -t 4,8,16

Multiple Flags

llama-bench -t 4 -t 8 -t 16

Ranges

# Linear range: first-last
llama-bench -t 4-16

# With step: first-last+step
llama-bench -t 4-16+4  # 4, 8, 12, 16

# Multiplicative: first-last*mult
llama-bench -ngl 1-32*2  # 1, 2, 4, 8, 16, 32

Output Formats

Markdown (Default)

llama-bench -o md
Produces formatted tables suitable for documentation.

CSV

llama-bench -o csv > results.csv
Comma-separated values for spreadsheet import.

JSON

llama-bench -o json > results.json
Structured data with individual repetition samples:
[
  {
    "build_commit": "8cf427ff",
    "build_number": 5163,
    "model_type": "qwen2 7B Q4_K - Medium",
    "n_prompt": 512,
    "n_gen": 0,
    "avg_ts": 7100.002165,
    "stddev_ts": 140.341520,
    "samples_ns": [74601900, 71632900, 71745200, 71952700, 70745500],
    "samples_ts": [6863.1, 7147.55, 7136.37, 7115.79, 7237.21]
  }
]

JSONL (JSON Lines)

llama-bench -o jsonl > results.jsonl
One JSON object per line, suitable for streaming processing.

SQL

llama-bench -o sql | sqlite3 benchmark.db
Generates SQL statements for direct database import:
CREATE TABLE IF NOT EXISTS test (
  build_commit TEXT,
  model_type TEXT,
  n_prompt INTEGER,
  n_gen INTEGER,
  avg_ts REAL,
  stddev_ts REAL,
  ...
);

INSERT INTO test (...) VALUES (...);

Advanced Options

NUMA Configuration

llama-bench --numa distribute  # Spread across NUMA nodes
llama-bench --numa isolate     # Isolate to single node
llama-bench --numa numactl     # Use numactl CPU map

Priority & Polling

--prio
integer
default:"0"
Process/thread priority:
  • 0: Normal
  • 1: Medium
  • 2: High
  • 3: Realtime
--poll
integer
default:"50"
Polling level (0-100). 0 = no polling.

Cache Types

llama-bench -ctk f16 -ctv q8_0  # KV cache types
Options: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1

Understanding Results

Tokens per Second (t/s)

The primary metric showing throughput:
  • Higher is better
  • Format: mean ± std_dev
  • Example: 120.60 ± 0.59 means ~121 tokens/sec with low variance

Test Notation

  • pp512: Prompt processing with 512 tokens
  • tg128: Text generation of 128 tokens
  • pp512 @ d512: Prompt processing at context depth 512
Important: llama-bench measurements do not include tokenization and sampling time. Real-world performance will be slightly lower.

Performance Analysis Tips

1

Baseline test

Run with default settings to establish baseline:
llama-bench -m model.gguf -r 10
2

Identify bottlenecks

Test with different configurations:
  • CPU vs GPU: -ngl 0 vs -ngl 99
  • Batch sizes: -b 128,512,2048
  • Thread counts: -t 4,8,16
3

Optimize settings

Find the sweet spot for your hardware:
  • Balance GPU layers for your VRAM
  • Adjust batch size for throughput vs latency
  • Test Flash Attention: -fa 0 vs -fa 1

Comparing Quantizations

Benchmark different quantization levels:
llama-bench \
  -m model-q4_0.gguf \
  -m model-q4_k_m.gguf \
  -m model-q5_k_m.gguf \
  -m model-q6_k.gguf \
  -m model-q8_0.gguf \
  -o csv
Compare speed vs quality trade-offs.

See Also