Skip to main content
Benchmark TensorRT-LLM models for throughput and latency performance. Supports multiple backends (PyTorch, TensorRT, AutoDeploy) and various benchmarking modes.

Usage

trtllm-bench --model MODEL [OPTIONS] COMMAND

Global Options

--model
string
required
The HuggingFace name of the model to benchmark. Alias: -m
--model_path
string
Path to a HuggingFace checkpoint directory for loading model components
--workspace
string
default:"/tmp"
The directory to store benchmarking intermediate files. Alias: -w
--log_level
string
default:"info"
The logging level. Choices: verbose, info, warning, error, internal_error
--revision
string
The revision to use for the HuggingFace model (branch name, tag name, or commit id)

Commands

throughput

Run a throughput benchmark to measure maximum request processing capacity.
trtllm-bench --model MODEL throughput [OPTIONS]

Throughput Options

--backend
string
default:"pytorch"
Backend to use for benchmark. Choices: pytorch, tensorrt, _autodeploy
--engine_dir
path
Path to a serialized TRT-LLM engine (for TensorRT backend)
--dataset
path
required
Path to dataset file for benchmarking. Use prepare-dataset command to generate
--num_requests
integer
default:"0"
Number of requests to cap benchmark run at. If not specified or set to 0, it will be the length of dataset
--warmup
integer
default:"2"
Number of requests to warm up benchmark
--max_batch_size
integer
Maximum runtime batch size to run the engine with
--max_num_tokens
integer
Maximum runtime tokens that an engine can accept
--max_seq_len
integer
Maximum sequence length
--beam_width
integer
default:"1"
Number of search beams
--kv_cache_free_gpu_mem_fraction
float
default:"0.9"
The percentage of memory to use for KV Cache after model load
--tp
integer
default:"1"
Tensor parallelism size
--pp
integer
default:"1"
Pipeline parallelism size
--ep
integer
Expert parallelism size (for MoE models)
--config
string
Path to a YAML file that overwrites the parameters. Can be specified as either --config or --extra_llm_api_options
--sampler_options
path
Path to a YAML file that sets sampler options
--no_skip_tokenizer_init
boolean
default:"false"
Do not skip tokenizer initialization when loading the model (required for VLM models)
--eos_id
integer
default:"-1"
Set the end-of-sequence token for the benchmark. Set to -1 to disable EOS
--modality
string
Modality of the multimodal requests. Choices: image, video
--image_data_format
string
default:"pt"
Format of the image data for multimodal models. Choices: pt, pil
--data_device
string
default:"cuda"
Device to load the multimodal data on. Choices: cuda, cpu
--max_input_len
integer
default:"4096"
Maximum input sequence length to use for multimodal models
--target_input_len
integer
Target (average) input length for tuning heuristics
--target_output_len
integer
Target (average) sequence length for tuning heuristics
--custom_module_dirs
path
Paths to custom module directories to import

latency

Run a latency benchmark to measure per-request response times.
trtllm-bench --model MODEL latency [OPTIONS]

Latency Options

--backend
string
default:"pytorch"
Backend to use for benchmark. Choices: pytorch, tensorrt, _autodeploy
--engine_dir
path
Path to a serialized TRT-LLM engine
--dataset
path
Path to dataset file for benchmarking
--num_requests
integer
default:"0"
Number of requests to cap benchmark run at. Minimum between value and length of dataset
--warmup
integer
default:"2"
Number of requests to warm up benchmark
--concurrency
integer
default:"1"
Desired concurrency rate (number of requests processing at the same time), less than or equal to 0 for no concurrency limit
--beam_width
integer
default:"1"
Number of search beams
--kv_cache_free_gpu_mem_fraction
float
default:"0.9"
The percentage of memory to use for KV Cache after model load
--max_seq_len
integer
Maximum sequence length
--tp
integer
default:"1"
Tensor parallelism size
--pp
integer
default:"1"
Pipeline parallelism size
--ep
integer
Expert parallelism size
--config
string
Path to a YAML file that overwrites the parameters
--sampler_options
path
Path to a YAML file that sets sampler options
--modality
string
Modality of the multimodal requests. Choices: image, video
--max_input_len
integer
default:"4096"
Maximum input sequence length to use for multimodal models
--medusa_choices
path
Path to a YAML file that defines the Medusa tree (for speculative decoding)
--report_json
path
Path where report should be written to
--iteration_log
path
Path where iteration logging is written to

prepare-dataset

Prepare a dataset for benchmarking.
trtllm-bench --model MODEL prepare-dataset [OPTIONS]

build

Build a TensorRT engine for benchmarking.
trtllm-bench --model MODEL build [OPTIONS]

Examples

Throughput Benchmark

Run a throughput benchmark with a prepared dataset:
trtllm-bench --model meta-llama/Llama-3.1-8B-Instruct \
  throughput \
  --dataset dataset.json \
  --backend pytorch

Latency Benchmark

Measure single-request latency:
trtllm-bench --model meta-llama/Llama-3.1-8B-Instruct \
  latency \
  --dataset dataset.json \
  --concurrency 1

Multi-GPU Throughput

Benchmark with tensor parallelism:
trtllm-bench --model meta-llama/Llama-3.1-70B-Instruct \
  throughput \
  --dataset dataset.json \
  --tp 4 \
  --max_batch_size 256

TensorRT Engine Benchmark

Benchmark a pre-built TensorRT engine:
trtllm-bench --model meta-llama/Llama-3.1-8B-Instruct \
  throughput \
  --engine_dir ./engine_outputs \
  --dataset dataset.json \
  --backend tensorrt

With Custom Configuration

Use a YAML configuration file:
trtllm-bench --model meta-llama/Llama-3.1-8B-Instruct \
  throughput \
  --dataset dataset.json \
  --config bench_config.yaml

Prepare Dataset

Generate a benchmark dataset:
trtllm-bench --model meta-llama/Llama-3.1-8B-Instruct \
  prepare-dataset \
  --output dataset.json \
  --num-requests 1000

High Concurrency Latency Test

Test latency under load:
trtllm-bench --model meta-llama/Llama-3.1-8B-Instruct \
  latency \
  --dataset dataset.json \
  --concurrency 32 \
  --num_requests 500

Multimodal Benchmark

Benchmark a vision-language model:
trtllm-bench --model llava-hf/llava-1.5-7b-hf \
  throughput \
  --dataset vlm_dataset.json \
  --modality image \
  --no_skip_tokenizer_init

Build docs developers (and LLMs) love