trtllm-bench CLI, a packaged benchmarking utility designed to make it easier to reproduce officially published performance results and measure your own workloads.
Overview
trtllm-bench offers:
- A streamlined way to build tuned engines for benchmarking across various models and platforms
- An entirely Python workflow for benchmarking
- Ability to benchmark various flows and features within TensorRT-LLM
- Support for throughput and latency benchmarking modes
Before You Begin
For rigorous benchmarking where consistent and reproducible results are critical, proper GPU configuration is essential. These settings help maximize GPU utilization, eliminate performance variability, and ensure optimal conditions for accurate measurements.Reset GPU Clock Management
Allow the GPU to dynamically adjust its clock speeds based on workload and temperature:
While locking clocks at maximum frequency might seem beneficial, it can sometimes lead to thermal throttling and reduced performance.
Set Power Limits
First query the maximum power limit:Then configure the GPU to operate at its maximum power limit:
Validated Networks
Whiletrtllm-bench should work with any network that TensorRT-LLM supports, the following have been extensively validated:
Llama Models
- meta-llama/Llama-2-7b-hf
- meta-llama/Llama-2-70b-hf
- meta-llama/Meta-Llama-3-8B
- meta-llama/Meta-Llama-3-70B
- meta-llama/Llama-3.1-8B
- meta-llama/Llama-3.1-70B
- meta-llama/Llama-3.1-405B
Other Models
- mistralai/Mistral-7B-v0.1
- mistralai/Mixtral-8x7B-v0.1
- tiiuae/falcon-180B
- EleutherAI/gpt-j-6b
Supported Quantization Modes
trtllm-bench supports the following quantization modes:
- None - No quantization applied
- FP8 - 8-bit floating point quantization
- NVFP4 - 4-bit NVIDIA floating point quantization
Although TensorRT-LLM supports more quantization modes,
trtllm-bench currently only configures for this subset.Preparing a Dataset
The throughput benchmark uses a fixed JSON schema to specify requests:| Key | Required | Type | Description |
|---|---|---|---|
task_id | Yes | String | Unique identifier for the request |
prompt | No* | String | Input text for generation |
input_ids | Yes* | List[Integer] | Token IDs that make up the prompt |
output_tokens | Yes | Integer | Number of tokens to generate |
Example Dataset Entries
With human-readable prompts:Generating Synthetic Datasets
Use theprepare-dataset command to generate synthetic data:
Throughput Benchmarking
Benchmark throughput using the PyTorch backend with a prepared dataset:Example Output
Streaming Metrics
When enabling streaming, time to first token (TTFT) and inter-token latency (ITL) metrics are also recorded:Latency Benchmarking
For low-latency mode benchmarking:Benchmarking with LoRA Adapters
The PyTorch workflow supports benchmarking with LoRA (Low-Rank Adaptation) adapters.Prepare LoRA Dataset
Generate requests with LoRA metadata:Key options:
--lora-dir: Parent directory containing LoRA adapter subdirectories named by task IDs (e.g.,0/,1/)--rand-task-id: Range of LoRA task IDs to randomly assign--task-id: Fixed LoRA task ID for all requests
The LoRA directory structure should have task-specific subdirectories named by their task IDs. Each subdirectory should contain the LoRA adapter files for that task.
Multimodal Benchmarking
Benchmark multimodal models (e.g., vision-language models):Quantization
To run quantized benchmarks, use pre-quantized checkpoints. TensorRT-LLM provides FP8-quantized Llama-3.1 models:- nvidia/Llama-3.1-8B-Instruct-FP8
- nvidia/Llama-3.1-70B-Instruct-FP8
- nvidia/Llama-3.1-405B-Instruct-FP8
KV Cache Quantization Mapping
When a checkpoint doesn’t specify KV cache quantization,trtllm-bench applies these defaults:
| Compute Precision | Checkpoint KV Cache | Applied KV Cache |
|---|---|---|
null | null | null (no config exists) |
FP8 | FP8 | FP8 (matches checkpoint) |
FP8 | null | FP8 (set by benchmark) |
NVFP4 | null | FP8 (set by benchmark) |
Forcing KV Cache Precision
Override KV cache quantization in your YAML config:Online Serving Benchmarking
For benchmarking the OpenAI-compatible server, see the trtllm-serve benchmarking guide. Alternatively, use AIPerf, a comprehensive benchmarking tool for OpenAI-compatible servers.Related Resources
Optimization Guide
Learn performance tuning best practices
Profiling
Profile and analyze performance bottlenecks
Reference Configs
170+ optimized serving configurations