llama-bench

Overview

llama-bench is a comprehensive performance testing tool for llama.cpp that measures inference speed, throughput, and resource utilization. It’s designed to help you optimize model configurations and compare performance across different settings.

Quick Start

# Run with defaults
llama-bench -m model.gguf

# Output:
# | model               |       size |     params | backend    | threads |          test |                  t/s |
# | ------------------- | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
# | qwen2 1.5B Q4_0     | 885.97 MiB |     1.54 B | Metal,BLAS |      16 |         pp512 |      5765.41 ± 20.55 |
# | qwen2 1.5B Q4_0     | 885.97 MiB |     1.54 B | Metal,BLAS |      16 |         tg128 |        197.71 ± 0.81 |

Test Types

llama-bench performs three types of tests:

Prompt Processing (pp)

Measures how fast the model processes a prompt in batches.

llama-bench -m model.gguf -p 512 -n 0

Text Generation (tg)

Measures token generation speed.

llama-bench -m model.gguf -p 0 -n 128

Combined (pg)

Measures prompt processing followed by text generation.

llama-bench -m model.gguf -pg 512,128

Command-Line Options

Basic Options

-h, --help

flag

Display help message and exit.

-r, --repetitions

integer

default:"5"

Number of times to repeat each test for averaging.

--delay

integer

default:"0"

Delay between each test in seconds.

-o, --output

string

default:"md"

Output format: md (markdown), csv, json, jsonl, or sql.

-oe, --output-err

string

Output format for stderr (same options as -o).

-v, --verbose

flag

Enable verbose output.

--progress

flag

Print test progress indicators.

Test Parameters

-m, --model

string

Path to model file. Can specify multiple models.Default: models/7B/ggml-model-q4_0.gguf

-p, --n-prompt

integer

default:"512"

Number of prompt tokens for prompt processing test.

-n, --n-gen

integer

default:"128"

Number of tokens to generate for text generation test.

-pg

string

Combined prompt processing and text generation test.Format: pp,tg (e.g., -pg 512,128)

-d, --n-depth

integer

default:"0"

Context depth: prefill KV cache with this many tokens before testing.

Performance Options

-b, --batch-size

integer

default:"2048"

Logical batch size.

-ub, --ubatch-size

integer

default:"512"

Physical batch size.

-t, --threads

integer

Number of CPU threads. Can specify multiple values.

-ngl, --n-gpu-layers

integer

default:"99"

Number of layers to offload to GPU.

-sm, --split-mode

string

default:"layer"

How to split model across GPUs: none, layer, or row.

-fa, --flash-attn

boolean

default:"0"

Enable Flash Attention (0 or 1).

Usage Examples

Compare Different Models

llama-bench \
  -m models/7B/ggml-model-q4_0.gguf \
  -m models/13B/ggml-model-q4_0.gguf \
  -p 0 -n 128,256,512

Output:

model	size	params	backend	ngl	test	t/s
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	99	tg 128	132.19 ± 0.55
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	99	tg 256	129.37 ± 0.54
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	99	tg 512	123.83 ± 0.25
llama 13B mostly Q4_0	6.86 GiB	13.02 B	CUDA	99	tg 128	82.17 ± 0.31
llama 13B mostly Q4_0	6.86 GiB	13.02 B	CUDA	99	tg 256	80.74 ± 0.23
llama 13B mostly Q4_0	6.86 GiB	13.02 B	CUDA	99	tg 512	78.08 ± 0.07

Test Batch Size Impact

llama-bench -n 0 -p 1024 -b 128,256,512,1024

Output:

model	size	params	backend	ngl	n_batch	test	t/s
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	99	128	pp 1024	1436.51 ± 3.66
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	99	256	pp 1024	1932.43 ± 23.48
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	99	512	pp 1024	2254.45 ± 15.59
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	pp 1024	2498.61 ± 13.58

Test Thread Scaling

llama-bench -n 0 -n 16 -p 64 -t 1,2,4,8,16,32

model	size	params	backend	threads	test	t/s
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	1	pp 64	6.17 ± 0.07
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	1	tg 16	4.05 ± 0.02
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	2	pp 64	12.31 ± 0.13
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	2	tg 16	7.80 ± 0.07
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	4	pp 64	23.18 ± 0.06
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CPU	4	tg 16	12.22 ± 0.07

Test GPU Layer Offloading

llama-bench -ngl 10,20,30,31,32,33,34,35

model	size	params	backend	ngl	test	t/s
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	10	pp 512	373.36 ± 2.25
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	10	tg 128	13.45 ± 0.93
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	20	pp 512	472.65 ± 1.25
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	20	tg 128	21.36 ± 1.94
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	30	pp 512	631.87 ± 11.25
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	30	tg 128	40.04 ± 1.82

Test Prefilled Context

Test performance with warm KV cache:

llama-bench -d 0,512

model	size	params	backend	ngl	test	t/s
qwen2 7B Q4_K - Medium	4.36 GiB	7.62 B	CUDA	99	pp512	7340.20 ± 23.45
qwen2 7B Q4_K - Medium	4.36 GiB	7.62 B	CUDA	99	tg128	120.60 ± 0.59
qwen2 7B Q4_K - Medium	4.36 GiB	7.62 B	CUDA	99	pp512 @ d512	6425.91 ± 18.88
qwen2 7B Q4_K - Medium	4.36 GiB	7.62 B	CUDA	99	tg128 @ d512	116.71 ± 0.60

Multiple Values & Ranges

You can specify multiple values in three ways:

Comma-Separated

llama-bench -t 4,8,16

Multiple Flags

llama-bench -t 4 -t 8 -t 16

Ranges

# Linear range: first-last
llama-bench -t 4-16

# With step: first-last+step
llama-bench -t 4-16+4  # 4, 8, 12, 16

# Multiplicative: first-last*mult
llama-bench -ngl 1-32*2  # 1, 2, 4, 8, 16, 32

Output Formats

Markdown (Default)

llama-bench -o md

Produces formatted tables suitable for documentation.

CSV

llama-bench -o csv > results.csv

Comma-separated values for spreadsheet import.

JSON

llama-bench -o json > results.json

Structured data with individual repetition samples:

[
  {
    "build_commit": "8cf427ff",
    "build_number": 5163,
    "model_type": "qwen2 7B Q4_K - Medium",
    "n_prompt": 512,
    "n_gen": 0,
    "avg_ts": 7100.002165,
    "stddev_ts": 140.341520,
    "samples_ns": [74601900, 71632900, 71745200, 71952700, 70745500],
    "samples_ts": [6863.1, 7147.55, 7136.37, 7115.79, 7237.21]
  }
]

JSONL (JSON Lines)

llama-bench -o jsonl > results.jsonl

One JSON object per line, suitable for streaming processing.

SQL

llama-bench -o sql | sqlite3 benchmark.db

Generates SQL statements for direct database import:

CREATE TABLE IF NOT EXISTS test (
  build_commit TEXT,
  model_type TEXT,
  n_prompt INTEGER,
  n_gen INTEGER,
  avg_ts REAL,
  stddev_ts REAL,
  ...
);

INSERT INTO test (...) VALUES (...);

Advanced Options

NUMA Configuration

llama-bench --numa distribute  # Spread across NUMA nodes
llama-bench --numa isolate     # Isolate to single node
llama-bench --numa numactl     # Use numactl CPU map

Priority & Polling

--prio

integer

default:"0"

Process/thread priority:

0: Normal
1: Medium
2: High
3: Realtime

--poll

integer

default:"50"

Polling level (0-100). 0 = no polling.

Cache Types

llama-bench -ctk f16 -ctv q8_0  # KV cache types

Options: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1

Understanding Results

Tokens per Second (t/s)

The primary metric showing throughput:

Higher is better
Format: mean ± std_dev
Example: 120.60 ± 0.59 means ~121 tokens/sec with low variance

Test Notation

pp512: Prompt processing with 512 tokens
tg128: Text generation of 128 tokens
pp512 @ d512: Prompt processing at context depth 512

Important: llama-bench measurements do not include tokenization and sampling time. Real-world performance will be slightly lower.

Performance Analysis Tips

Baseline test

Run with default settings to establish baseline:

llama-bench -m model.gguf -r 10

Identify bottlenecks

Test with different configurations:

CPU vs GPU: -ngl 0 vs -ngl 99
Batch sizes: -b 128,512,2048
Thread counts: -t 4,8,16

Optimize settings

Find the sweet spot for your hardware:

Balance GPU layers for your VRAM
Adjust batch size for throughput vs latency
Test Flash Attention: -fa 0 vs -fa 1

Comparing Quantizations

Benchmark different quantization levels:

llama-bench \
  -m model-q4_0.gguf \
  -m model-q4_k_m.gguf \
  -m model-q5_k_m.gguf \
  -m model-q6_k.gguf \
  -m model-q8_0.gguf \
  -o csv

Compare speed vs quality trade-offs.

C/C++ API

REST API

Tools

Overview

Quick Start

Test Types

Prompt Processing (pp)

Text Generation (tg)

Combined (pg)

Command-Line Options

Basic Options

Test Parameters

Performance Options

Usage Examples

Compare Different Models

Test Batch Size Impact

Test Thread Scaling

Test GPU Layer Offloading

Test Prefilled Context

Multiple Values & Ranges

Comma-Separated

Multiple Flags

Ranges

Output Formats

Markdown (Default)

CSV

JSON

JSONL (JSON Lines)

SQL

Advanced Options

NUMA Configuration

Priority & Polling

Cache Types

Understanding Results

Tokens per Second (t/s)

Test Notation

Performance Analysis Tips

Comparing Quantizations

See Also

C/C++ API

REST API

Tools

​Overview

​Quick Start

​Test Types

​Prompt Processing (pp)

​Text Generation (tg)

​Combined (pg)

​Command-Line Options

​Basic Options

​Test Parameters

​Performance Options

​Usage Examples

​Compare Different Models

​Test Batch Size Impact

​Test Thread Scaling

​Test GPU Layer Offloading

​Test Prefilled Context

​Multiple Values & Ranges

​Comma-Separated

​Multiple Flags

​Ranges

​Output Formats

​Markdown (Default)

​CSV

​JSON

​JSONL (JSON Lines)

​SQL

​Advanced Options

​NUMA Configuration

​Priority & Polling

​Cache Types

​Understanding Results

​Tokens per Second (t/s)

​Test Notation

​Performance Analysis Tips

​Comparing Quantizations

​See Also

Overview

Quick Start

Test Types

Prompt Processing (pp)

Text Generation (tg)

Combined (pg)

Command-Line Options

Basic Options

Test Parameters

Performance Options

Usage Examples

Compare Different Models

Test Batch Size Impact

Test Thread Scaling

Test GPU Layer Offloading

Test Prefilled Context

Multiple Values & Ranges

Comma-Separated

Multiple Flags

Ranges

Output Formats

Markdown (Default)

CSV

JSON

JSONL (JSON Lines)

SQL

Advanced Options

NUMA Configuration

Priority & Polling

Cache Types

Understanding Results

Tokens per Second (t/s)

Test Notation

Performance Analysis Tips

Comparing Quantizations

See Also