Usage
Global Options
The HuggingFace name of the model to benchmark. Alias:
-mPath to a HuggingFace checkpoint directory for loading model components
The directory to store benchmarking intermediate files. Alias:
-wThe logging level. Choices:
verbose, info, warning, error, internal_errorThe revision to use for the HuggingFace model (branch name, tag name, or commit id)
Commands
throughput
Run a throughput benchmark to measure maximum request processing capacity.Throughput Options
Backend to use for benchmark. Choices:
pytorch, tensorrt, _autodeployPath to a serialized TRT-LLM engine (for TensorRT backend)
Path to dataset file for benchmarking. Use
prepare-dataset command to generateNumber of requests to cap benchmark run at. If not specified or set to 0, it will be the length of dataset
Number of requests to warm up benchmark
Maximum runtime batch size to run the engine with
Maximum runtime tokens that an engine can accept
Maximum sequence length
Number of search beams
The percentage of memory to use for KV Cache after model load
Tensor parallelism size
Pipeline parallelism size
Expert parallelism size (for MoE models)
Path to a YAML file that overwrites the parameters. Can be specified as either
--config or --extra_llm_api_optionsPath to a YAML file that sets sampler options
Do not skip tokenizer initialization when loading the model (required for VLM models)
Set the end-of-sequence token for the benchmark. Set to -1 to disable EOS
Modality of the multimodal requests. Choices:
image, videoFormat of the image data for multimodal models. Choices:
pt, pilDevice to load the multimodal data on. Choices:
cuda, cpuMaximum input sequence length to use for multimodal models
Target (average) input length for tuning heuristics
Target (average) sequence length for tuning heuristics
Paths to custom module directories to import
latency
Run a latency benchmark to measure per-request response times.Latency Options
Backend to use for benchmark. Choices:
pytorch, tensorrt, _autodeployPath to a serialized TRT-LLM engine
Path to dataset file for benchmarking
Number of requests to cap benchmark run at. Minimum between value and length of dataset
Number of requests to warm up benchmark
Desired concurrency rate (number of requests processing at the same time), less than or equal to 0 for no concurrency limit
Number of search beams
The percentage of memory to use for KV Cache after model load
Maximum sequence length
Tensor parallelism size
Pipeline parallelism size
Expert parallelism size
Path to a YAML file that overwrites the parameters
Path to a YAML file that sets sampler options
Modality of the multimodal requests. Choices:
image, videoMaximum input sequence length to use for multimodal models
Path to a YAML file that defines the Medusa tree (for speculative decoding)
Path where report should be written to
Path where iteration logging is written to
prepare-dataset
Prepare a dataset for benchmarking.build
Build a TensorRT engine for benchmarking.Examples
Throughput Benchmark
Run a throughput benchmark with a prepared dataset:Latency Benchmark
Measure single-request latency:Multi-GPU Throughput
Benchmark with tensor parallelism:TensorRT Engine Benchmark
Benchmark a pre-built TensorRT engine:With Custom Configuration
Use a YAML configuration file:Prepare Dataset
Generate a benchmark dataset:High Concurrency Latency Test
Test latency under load:Multimodal Benchmark
Benchmark a vision-language model:Related Commands
- trtllm-serve - Serve models via API
- trtllm-build - Build TensorRT engines
- trtllm-eval - Evaluate model accuracy