trtllm-bench

Benchmark TensorRT-LLM models for throughput and latency performance. Supports multiple backends (PyTorch, TensorRT, AutoDeploy) and various benchmarking modes.

Usage

trtllm-bench --model MODEL [OPTIONS] COMMAND

Global Options

--model

string

required

The HuggingFace name of the model to benchmark. Alias: -m

--model_path

string

Path to a HuggingFace checkpoint directory for loading model components

--workspace

string

default:"/tmp"

The directory to store benchmarking intermediate files. Alias: -w

--log_level

string

default:"info"

The logging level. Choices: verbose, info, warning, error, internal_error

--revision

string

The revision to use for the HuggingFace model (branch name, tag name, or commit id)

Commands

throughput

Run a throughput benchmark to measure maximum request processing capacity.

trtllm-bench --model MODEL throughput [OPTIONS]

Throughput Options

--backend

string

default:"pytorch"

Backend to use for benchmark. Choices: pytorch, tensorrt, _autodeploy

--engine_dir

path

Path to a serialized TRT-LLM engine (for TensorRT backend)

--dataset

path

required

Path to dataset file for benchmarking. Use prepare-dataset command to generate

--num_requests

integer

default:"0"

Number of requests to cap benchmark run at. If not specified or set to 0, it will be the length of dataset

--warmup

integer

default:"2"

Number of requests to warm up benchmark

--max_batch_size

integer

Maximum runtime batch size to run the engine with

--max_num_tokens

integer

Maximum runtime tokens that an engine can accept

--max_seq_len

integer

Maximum sequence length

--beam_width

integer

default:"1"

Number of search beams

--kv_cache_free_gpu_mem_fraction

float

default:"0.9"

The percentage of memory to use for KV Cache after model load

--tp

integer

default:"1"

Tensor parallelism size

--pp

integer

default:"1"

Pipeline parallelism size

--ep

integer

Expert parallelism size (for MoE models)

--config

string

Path to a YAML file that overwrites the parameters. Can be specified as either --config or --extra_llm_api_options

--sampler_options

path

Path to a YAML file that sets sampler options

--no_skip_tokenizer_init

boolean

default:"false"

Do not skip tokenizer initialization when loading the model (required for VLM models)

--eos_id

integer

default:"-1"

Set the end-of-sequence token for the benchmark. Set to -1 to disable EOS

--modality

string

Modality of the multimodal requests. Choices: image, video

--image_data_format

string

default:"pt"

Format of the image data for multimodal models. Choices: pt, pil

--data_device

string

default:"cuda"

Device to load the multimodal data on. Choices: cuda, cpu

--max_input_len

integer

default:"4096"

Maximum input sequence length to use for multimodal models

--target_input_len

integer

Target (average) input length for tuning heuristics

--target_output_len

integer

Target (average) sequence length for tuning heuristics

--custom_module_dirs

path

Paths to custom module directories to import

latency

Run a latency benchmark to measure per-request response times.

trtllm-bench --model MODEL latency [OPTIONS]

Latency Options

--backend

string

default:"pytorch"

Backend to use for benchmark. Choices: pytorch, tensorrt, _autodeploy

--engine_dir

path

Path to a serialized TRT-LLM engine

--dataset

path

Path to dataset file for benchmarking

--num_requests

integer

default:"0"

Number of requests to cap benchmark run at. Minimum between value and length of dataset

--warmup

integer

default:"2"

Number of requests to warm up benchmark

--concurrency

integer

default:"1"

Desired concurrency rate (number of requests processing at the same time), less than or equal to 0 for no concurrency limit

--beam_width

integer

default:"1"

Number of search beams

--kv_cache_free_gpu_mem_fraction

float

default:"0.9"

The percentage of memory to use for KV Cache after model load

--max_seq_len

integer

Maximum sequence length

--tp

integer

default:"1"

Tensor parallelism size

--pp

integer

default:"1"

Pipeline parallelism size

--ep

integer

Expert parallelism size

--config

string

Path to a YAML file that overwrites the parameters

--sampler_options

path

Path to a YAML file that sets sampler options

--modality

string

Modality of the multimodal requests. Choices: image, video

--max_input_len

integer

default:"4096"

Maximum input sequence length to use for multimodal models

--medusa_choices

path

Path to a YAML file that defines the Medusa tree (for speculative decoding)

--report_json

path

Path where report should be written to

--iteration_log

path

Path where iteration logging is written to

prepare-dataset

Prepare a dataset for benchmarking.

trtllm-bench --model MODEL prepare-dataset [OPTIONS]

build

Build a TensorRT engine for benchmarking.

trtllm-bench --model MODEL build [OPTIONS]

Examples

Throughput Benchmark

Run a throughput benchmark with a prepared dataset:

trtllm-bench --model meta-llama/Llama-3.1-8B-Instruct \
  throughput \
  --dataset dataset.json \
  --backend pytorch

Latency Benchmark

Measure single-request latency:

trtllm-bench --model meta-llama/Llama-3.1-8B-Instruct \
  latency \
  --dataset dataset.json \
  --concurrency 1

Multi-GPU Throughput

Benchmark with tensor parallelism:

trtllm-bench --model meta-llama/Llama-3.1-70B-Instruct \
  throughput \
  --dataset dataset.json \
  --tp 4 \
  --max_batch_size 256

TensorRT Engine Benchmark

Benchmark a pre-built TensorRT engine:

trtllm-bench --model meta-llama/Llama-3.1-8B-Instruct \
  throughput \
  --engine_dir ./engine_outputs \
  --dataset dataset.json \
  --backend tensorrt

With Custom Configuration

Use a YAML configuration file:

trtllm-bench --model meta-llama/Llama-3.1-8B-Instruct \
  throughput \
  --dataset dataset.json \
  --config bench_config.yaml

Prepare Dataset

Generate a benchmark dataset:

trtllm-bench --model meta-llama/Llama-3.1-8B-Instruct \
  prepare-dataset \
  --output dataset.json \
  --num-requests 1000

High Concurrency Latency Test

Test latency under load:

trtllm-bench --model meta-llama/Llama-3.1-8B-Instruct \
  latency \
  --dataset dataset.json \
  --concurrency 32 \
  --num_requests 500

Multimodal Benchmark

Benchmark a vision-language model:

trtllm-bench --model llava-hf/llava-1.5-7b-hf \
  throughput \
  --dataset vlm_dataset.json \
  --modality image \
  --no_skip_tokenizer_init

trtllm-serve - Serve models via API
trtllm-build - Build TensorRT engines
trtllm-eval - Evaluate model accuracy

Python API

CLI Tools

Configuration

Usage

Global Options

Commands

throughput

Throughput Options

latency

Latency Options

prepare-dataset

build

Examples

Throughput Benchmark

Latency Benchmark

Multi-GPU Throughput

TensorRT Engine Benchmark

With Custom Configuration

Prepare Dataset

High Concurrency Latency Test

Multimodal Benchmark

Build docs developers (and LLMs) love

Python API

CLI Tools

Configuration

​Usage

​Global Options

​Commands

​throughput

​Throughput Options

​latency

​Latency Options

​prepare-dataset

​build

​Examples

​Throughput Benchmark

​Latency Benchmark

​Multi-GPU Throughput

​TensorRT Engine Benchmark

​With Custom Configuration

​Prepare Dataset

​High Concurrency Latency Test

​Multimodal Benchmark

​Related Commands

Build docs developers (and LLMs) love

Usage

Global Options

Commands

throughput

Throughput Options

latency

Latency Options

prepare-dataset

build

Examples

Throughput Benchmark

Latency Benchmark

Multi-GPU Throughput

TensorRT Engine Benchmark

With Custom Configuration

Prepare Dataset

High Concurrency Latency Test

Multimodal Benchmark

Related Commands