Overview
llama-bench is a comprehensive performance testing tool for llama.cpp that measures inference speed, throughput, and resource utilization. It’s designed to help you optimize model configurations and compare performance across different settings.
Quick Start
Test Types
llama-bench performs three types of tests:
Prompt Processing (pp)
Measures how fast the model processes a prompt in batches.Text Generation (tg)
Measures token generation speed.Combined (pg)
Measures prompt processing followed by text generation.Command-Line Options
Basic Options
Display help message and exit.
Number of times to repeat each test for averaging.
Delay between each test in seconds.
Output format:
md (markdown), csv, json, jsonl, or sql.Output format for stderr (same options as
-o).Enable verbose output.
Print test progress indicators.
Test Parameters
Path to model file. Can specify multiple models.Default:
models/7B/ggml-model-q4_0.ggufNumber of prompt tokens for prompt processing test.
Number of tokens to generate for text generation test.
Combined prompt processing and text generation test.Format:
pp,tg (e.g., -pg 512,128)Context depth: prefill KV cache with this many tokens before testing.
Performance Options
Logical batch size.
Physical batch size.
Number of CPU threads. Can specify multiple values.
Number of layers to offload to GPU.
How to split model across GPUs:
none, layer, or row.Enable Flash Attention (0 or 1).
Usage Examples
Compare Different Models
| model | size | params | backend | ngl | test | t/s |
|---|---|---|---|---|---|---|
| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | tg 128 | 132.19 ± 0.55 |
| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | tg 256 | 129.37 ± 0.54 |
| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | tg 512 | 123.83 ± 0.25 |
| llama 13B mostly Q4_0 | 6.86 GiB | 13.02 B | CUDA | 99 | tg 128 | 82.17 ± 0.31 |
| llama 13B mostly Q4_0 | 6.86 GiB | 13.02 B | CUDA | 99 | tg 256 | 80.74 ± 0.23 |
| llama 13B mostly Q4_0 | 6.86 GiB | 13.02 B | CUDA | 99 | tg 512 | 78.08 ± 0.07 |
Test Batch Size Impact
| model | size | params | backend | ngl | n_batch | test | t/s |
|---|---|---|---|---|---|---|---|
| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 128 | pp 1024 | 1436.51 ± 3.66 |
| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 256 | pp 1024 | 1932.43 ± 23.48 |
| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 512 | pp 1024 | 2254.45 ± 15.59 |
| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 1024 | pp 1024 | 2498.61 ± 13.58 |
Test Thread Scaling
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 1 | pp 64 | 6.17 ± 0.07 |
| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 1 | tg 16 | 4.05 ± 0.02 |
| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 2 | pp 64 | 12.31 ± 0.13 |
| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 2 | tg 16 | 7.80 ± 0.07 |
| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 4 | pp 64 | 23.18 ± 0.06 |
| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CPU | 4 | tg 16 | 12.22 ± 0.07 |
Test GPU Layer Offloading
| model | size | params | backend | ngl | test | t/s |
|---|---|---|---|---|---|---|
| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 10 | pp 512 | 373.36 ± 2.25 |
| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 10 | tg 128 | 13.45 ± 0.93 |
| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 20 | pp 512 | 472.65 ± 1.25 |
| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 20 | tg 128 | 21.36 ± 1.94 |
| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 30 | pp 512 | 631.87 ± 11.25 |
| llama 7B mostly Q4_0 | 3.56 GiB | 6.74 B | CUDA | 30 | tg 128 | 40.04 ± 1.82 |
Test Prefilled Context
Test performance with warm KV cache:| model | size | params | backend | ngl | test | t/s |
|---|---|---|---|---|---|---|
| qwen2 7B Q4_K - Medium | 4.36 GiB | 7.62 B | CUDA | 99 | pp512 | 7340.20 ± 23.45 |
| qwen2 7B Q4_K - Medium | 4.36 GiB | 7.62 B | CUDA | 99 | tg128 | 120.60 ± 0.59 |
| qwen2 7B Q4_K - Medium | 4.36 GiB | 7.62 B | CUDA | 99 | pp512 @ d512 | 6425.91 ± 18.88 |
| qwen2 7B Q4_K - Medium | 4.36 GiB | 7.62 B | CUDA | 99 | tg128 @ d512 | 116.71 ± 0.60 |
Multiple Values & Ranges
You can specify multiple values in three ways:Comma-Separated
Multiple Flags
Ranges
Output Formats
Markdown (Default)
CSV
JSON
JSONL (JSON Lines)
SQL
Advanced Options
NUMA Configuration
Priority & Polling
Process/thread priority:
0: Normal1: Medium2: High3: Realtime
Polling level (0-100).
0 = no polling.Cache Types
f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
Understanding Results
Tokens per Second (t/s)
The primary metric showing throughput:- Higher is better
- Format:
mean ± std_dev - Example:
120.60 ± 0.59means ~121 tokens/sec with low variance
Test Notation
- pp512: Prompt processing with 512 tokens
- tg128: Text generation of 128 tokens
- pp512 @ d512: Prompt processing at context depth 512
Important: llama-bench measurements do not include tokenization and sampling time. Real-world performance will be slightly lower.
Performance Analysis Tips
Identify bottlenecks
Test with different configurations:
- CPU vs GPU:
-ngl 0vs-ngl 99 - Batch sizes:
-b 128,512,2048 - Thread counts:
-t 4,8,16
Comparing Quantizations
Benchmark different quantization levels:See Also
- llama-cli - Interactive CLI tool
- llama-perplexity - Quality measurement tool
- Performance Tips

