Skip to main content
Evaluate TensorRT-LLM model accuracy on standard benchmarks including MMLU, GSM8K, CNN/DailyMail, GPQA, MMMU, and LongBench datasets.

Usage

trtllm-eval [OPTIONS] COMMAND

Global Options

--model
string
required
Model name, HuggingFace checkpoint path, or TensorRT engine path
--tokenizer
string
Path or name of the tokenizer. Specify this value only if using TensorRT engine as model
--custom_tokenizer
string
Custom tokenizer type: alias (e.g., ‘deepseek_v32’) or Python import path (e.g., ‘tensorrt_llm.tokenizer.deepseek_v32.DeepseekV32Tokenizer’)
--backend
string
default:"pytorch"
Backend to use for evaluation. Choices: pytorch, tensorrt
--log_level
string
default:"info"
The logging level. Choices: verbose, info, warning, error, internal_error

Model Configuration

--max_batch_size
integer
default:"8"
Maximum number of requests that the engine can schedule
--max_num_tokens
integer
default:"2048"
Maximum number of batched input tokens after padding is removed in each batch
--max_seq_len
integer
Maximum total length of one request, including prompt and outputs. If unspecified, the value is deduced from the model config
--max_beam_width
integer
default:"1"
Maximum number of beams for beam search decoding

Parallelism Options

--tp_size
integer
default:"1"
Tensor parallelism size
--pp_size
integer
default:"1"
Pipeline parallelism size
--ep_size
integer
Expert parallelism size (for MoE models)
--gpus_per_node
integer
Number of GPUs per node. Defaults to automatic detection

Memory Options

--kv_cache_free_gpu_memory_fraction
float
default:"0.9"
Free GPU memory fraction reserved for KV Cache, after allocating model weights and buffers
--disable_kv_cache_reuse
boolean
default:"false"
Flag for disabling KV cache reuse

Advanced Options

--config
string
Path to a YAML file that overwrites the parameters. Can be specified as either --config or --extra_llm_api_options
--trust_remote_code
boolean
default:"false"
Flag for HuggingFace transformers to trust remote code
--revision
string
The revision to use for the HuggingFace model (branch name, tag name, or commit id)

Evaluation Tasks

mmlu

Evaluate on the MMLU (Massive Multitask Language Understanding) benchmark.
trtllm-eval --model MODEL mmlu [OPTIONS]

gsm8k

Evaluate on the GSM8K (Grade School Math 8K) benchmark.
trtllm-eval --model MODEL gsm8k [OPTIONS]

cnn-dailymail

Evaluate on the CNN/DailyMail summarization benchmark.
trtllm-eval --model MODEL cnn-dailymail [OPTIONS]

gpqa-diamond

Evaluate on the GPQA Diamond benchmark.
trtllm-eval --model MODEL gpqa-diamond [OPTIONS]

gpqa-main

Evaluate on the GPQA Main benchmark.
trtllm-eval --model MODEL gpqa-main [OPTIONS]

gpqa-extended

Evaluate on the GPQA Extended benchmark.
trtllm-eval --model MODEL gpqa-extended [OPTIONS]

mmmu

Evaluate on the MMMU (Massive Multi-discipline Multimodal Understanding) benchmark.
trtllm-eval --model MODEL mmmu [OPTIONS]

longbench-v1

Evaluate on the LongBench v1 benchmark.
trtllm-eval --model MODEL longbench-v1 [OPTIONS]

longbench-v2

Evaluate on the LongBench v2 benchmark.
trtllm-eval --model MODEL longbench-v2 [OPTIONS]

json-mode

Evaluate JSON mode generation capabilities.
trtllm-eval --model MODEL json-mode [OPTIONS]

Examples

MMLU Evaluation

Run MMLU evaluation on a model:
trtllm-eval --model meta-llama/Llama-3.1-8B-Instruct \
  --backend pytorch \
  mmlu

GSM8K with TensorRT Backend

Evaluate math reasoning with TensorRT:
trtllm-eval --model meta-llama/Llama-3.1-8B-Instruct \
  --backend tensorrt \
  gsm8k

Multi-GPU Evaluation

Run evaluation with tensor parallelism:
trtllm-eval --model meta-llama/Llama-3.1-70B-Instruct \
  --tp_size 4 \
  mmlu

CNN/DailyMail Summarization

Evaluate summarization capabilities:
trtllm-eval --model meta-llama/Llama-3.1-8B-Instruct \
  cnn-dailymail

With Custom Configuration

Use a YAML configuration file:
trtllm-eval --model meta-llama/Llama-3.1-8B-Instruct \
  --config eval_config.yaml \
  mmlu

LongBench Evaluation

Evaluate long-context capabilities:
trtllm-eval --model meta-llama/Llama-3.1-8B-Instruct \
  --max_seq_len 32768 \
  longbench-v2

Custom Tokenizer

Evaluate with a custom tokenizer:
trtllm-eval --model /path/to/model \
  --tokenizer /path/to/tokenizer \
  mmlu

High Batch Size for Speed

Maximize evaluation throughput:
trtllm-eval --model meta-llama/Llama-3.1-8B-Instruct \
  --max_batch_size 64 \
  --max_num_tokens 16384 \
  mmlu

GPQA Benchmark Suite

Run all GPQA variants:
trtllm-eval --model meta-llama/Llama-3.1-8B-Instruct gpqa-diamond
trtllm-eval --model meta-llama/Llama-3.1-8B-Instruct gpqa-main
trtllm-eval --model meta-llama/Llama-3.1-8B-Instruct gpqa-extended

TensorRT Engine Evaluation

Evaluate a pre-built TensorRT engine:
trtllm-eval --model /path/to/engine \
  --tokenizer meta-llama/Llama-3.1-8B-Instruct \
  --backend tensorrt \
  mmlu

Common Evaluation Tasks

TaskDescriptionDataset
mmluMulti-task language understanding57 subjects, multiple choice
gsm8kGrade school math word problems8.5K math problems
cnn-dailymailText summarizationNews article summaries
gpqa-diamondGraduate-level science questionsHigh-quality subset
gpqa-mainGraduate-level science questionsMain dataset
gpqa-extendedGraduate-level science questionsExtended dataset
mmmuMultimodal understandingMulti-discipline, multimodal
longbench-v1Long-context understandingV1 benchmark
longbench-v2Long-context understandingV2 benchmark
json-modeJSON generationStructured output

Build docs developers (and LLMs) love