Usage
Global Options
Model name, HuggingFace checkpoint path, or TensorRT engine path
Path or name of the tokenizer. Specify this value only if using TensorRT engine as model
Custom tokenizer type: alias (e.g., ‘deepseek_v32’) or Python import path (e.g., ‘tensorrt_llm.tokenizer.deepseek_v32.DeepseekV32Tokenizer’)
Backend to use for evaluation. Choices:
pytorch, tensorrtThe logging level. Choices:
verbose, info, warning, error, internal_errorModel Configuration
Maximum number of requests that the engine can schedule
Maximum number of batched input tokens after padding is removed in each batch
Maximum total length of one request, including prompt and outputs. If unspecified, the value is deduced from the model config
Maximum number of beams for beam search decoding
Parallelism Options
Tensor parallelism size
Pipeline parallelism size
Expert parallelism size (for MoE models)
Number of GPUs per node. Defaults to automatic detection
Memory Options
Free GPU memory fraction reserved for KV Cache, after allocating model weights and buffers
Flag for disabling KV cache reuse
Advanced Options
Path to a YAML file that overwrites the parameters. Can be specified as either
--config or --extra_llm_api_optionsFlag for HuggingFace transformers to trust remote code
The revision to use for the HuggingFace model (branch name, tag name, or commit id)
Evaluation Tasks
mmlu
Evaluate on the MMLU (Massive Multitask Language Understanding) benchmark.gsm8k
Evaluate on the GSM8K (Grade School Math 8K) benchmark.cnn-dailymail
Evaluate on the CNN/DailyMail summarization benchmark.gpqa-diamond
Evaluate on the GPQA Diamond benchmark.gpqa-main
Evaluate on the GPQA Main benchmark.gpqa-extended
Evaluate on the GPQA Extended benchmark.mmmu
Evaluate on the MMMU (Massive Multi-discipline Multimodal Understanding) benchmark.longbench-v1
Evaluate on the LongBench v1 benchmark.longbench-v2
Evaluate on the LongBench v2 benchmark.json-mode
Evaluate JSON mode generation capabilities.Examples
MMLU Evaluation
Run MMLU evaluation on a model:GSM8K with TensorRT Backend
Evaluate math reasoning with TensorRT:Multi-GPU Evaluation
Run evaluation with tensor parallelism:CNN/DailyMail Summarization
Evaluate summarization capabilities:With Custom Configuration
Use a YAML configuration file:LongBench Evaluation
Evaluate long-context capabilities:Custom Tokenizer
Evaluate with a custom tokenizer:High Batch Size for Speed
Maximize evaluation throughput:GPQA Benchmark Suite
Run all GPQA variants:TensorRT Engine Evaluation
Evaluate a pre-built TensorRT engine:Common Evaluation Tasks
| Task | Description | Dataset |
|---|---|---|
mmlu | Multi-task language understanding | 57 subjects, multiple choice |
gsm8k | Grade school math word problems | 8.5K math problems |
cnn-dailymail | Text summarization | News article summaries |
gpqa-diamond | Graduate-level science questions | High-quality subset |
gpqa-main | Graduate-level science questions | Main dataset |
gpqa-extended | Graduate-level science questions | Extended dataset |
mmmu | Multimodal understanding | Multi-discipline, multimodal |
longbench-v1 | Long-context understanding | V1 benchmark |
longbench-v2 | Long-context understanding | V2 benchmark |
json-mode | JSON generation | Structured output |
Related Commands
- trtllm-serve - Serve models via API
- trtllm-bench - Benchmark model performance
- trtllm-build - Build TensorRT engines