trtllm-eval

Evaluate TensorRT-LLM model accuracy on standard benchmarks including MMLU, GSM8K, CNN/DailyMail, GPQA, MMMU, and LongBench datasets.

Usage

trtllm-eval [OPTIONS] COMMAND

Global Options

--model

string

required

Model name, HuggingFace checkpoint path, or TensorRT engine path

--tokenizer

string

Path or name of the tokenizer. Specify this value only if using TensorRT engine as model

--custom_tokenizer

string

Custom tokenizer type: alias (e.g., ‘deepseek_v32’) or Python import path (e.g., ‘tensorrt_llm.tokenizer.deepseek_v32.DeepseekV32Tokenizer’)

--backend

string

default:"pytorch"

Backend to use for evaluation. Choices: pytorch, tensorrt

--log_level

string

default:"info"

The logging level. Choices: verbose, info, warning, error, internal_error

Model Configuration

--max_batch_size

integer

default:"8"

Maximum number of requests that the engine can schedule

--max_num_tokens

integer

default:"2048"

Maximum number of batched input tokens after padding is removed in each batch

--max_seq_len

integer

Maximum total length of one request, including prompt and outputs. If unspecified, the value is deduced from the model config

--max_beam_width

integer

default:"1"

Maximum number of beams for beam search decoding

Parallelism Options

--tp_size

integer

default:"1"

Tensor parallelism size

--pp_size

integer

default:"1"

Pipeline parallelism size

--ep_size

integer

Expert parallelism size (for MoE models)

--gpus_per_node

integer

Number of GPUs per node. Defaults to automatic detection

Memory Options

--kv_cache_free_gpu_memory_fraction

float

default:"0.9"

Free GPU memory fraction reserved for KV Cache, after allocating model weights and buffers

--disable_kv_cache_reuse

boolean

default:"false"

Flag for disabling KV cache reuse

Advanced Options

--config

string

Path to a YAML file that overwrites the parameters. Can be specified as either --config or --extra_llm_api_options

--trust_remote_code

boolean

default:"false"

Flag for HuggingFace transformers to trust remote code

--revision

string

The revision to use for the HuggingFace model (branch name, tag name, or commit id)

Evaluation Tasks

mmlu

Evaluate on the MMLU (Massive Multitask Language Understanding) benchmark.

trtllm-eval --model MODEL mmlu [OPTIONS]

gsm8k

Evaluate on the GSM8K (Grade School Math 8K) benchmark.

trtllm-eval --model MODEL gsm8k [OPTIONS]

cnn-dailymail

Evaluate on the CNN/DailyMail summarization benchmark.

trtllm-eval --model MODEL cnn-dailymail [OPTIONS]

gpqa-diamond

Evaluate on the GPQA Diamond benchmark.

trtllm-eval --model MODEL gpqa-diamond [OPTIONS]

gpqa-main

Evaluate on the GPQA Main benchmark.

trtllm-eval --model MODEL gpqa-main [OPTIONS]

gpqa-extended

Evaluate on the GPQA Extended benchmark.

trtllm-eval --model MODEL gpqa-extended [OPTIONS]

mmmu

Evaluate on the MMMU (Massive Multi-discipline Multimodal Understanding) benchmark.

trtllm-eval --model MODEL mmmu [OPTIONS]

longbench-v1

Evaluate on the LongBench v1 benchmark.

trtllm-eval --model MODEL longbench-v1 [OPTIONS]

longbench-v2

Evaluate on the LongBench v2 benchmark.

trtllm-eval --model MODEL longbench-v2 [OPTIONS]

json-mode

Evaluate JSON mode generation capabilities.

trtllm-eval --model MODEL json-mode [OPTIONS]

Examples

MMLU Evaluation

Run MMLU evaluation on a model:

trtllm-eval --model meta-llama/Llama-3.1-8B-Instruct \
  --backend pytorch \
  mmlu

GSM8K with TensorRT Backend

Evaluate math reasoning with TensorRT:

trtllm-eval --model meta-llama/Llama-3.1-8B-Instruct \
  --backend tensorrt \
  gsm8k

Multi-GPU Evaluation

Run evaluation with tensor parallelism:

trtllm-eval --model meta-llama/Llama-3.1-70B-Instruct \
  --tp_size 4 \
  mmlu

CNN/DailyMail Summarization

Evaluate summarization capabilities:

trtllm-eval --model meta-llama/Llama-3.1-8B-Instruct \
  cnn-dailymail

With Custom Configuration

Use a YAML configuration file:

trtllm-eval --model meta-llama/Llama-3.1-8B-Instruct \
  --config eval_config.yaml \
  mmlu

LongBench Evaluation

Evaluate long-context capabilities:

trtllm-eval --model meta-llama/Llama-3.1-8B-Instruct \
  --max_seq_len 32768 \
  longbench-v2

Custom Tokenizer

Evaluate with a custom tokenizer:

trtllm-eval --model /path/to/model \
  --tokenizer /path/to/tokenizer \
  mmlu

High Batch Size for Speed

Maximize evaluation throughput:

trtllm-eval --model meta-llama/Llama-3.1-8B-Instruct \
  --max_batch_size 64 \
  --max_num_tokens 16384 \
  mmlu

GPQA Benchmark Suite

Run all GPQA variants:

trtllm-eval --model meta-llama/Llama-3.1-8B-Instruct gpqa-diamond
trtllm-eval --model meta-llama/Llama-3.1-8B-Instruct gpqa-main
trtllm-eval --model meta-llama/Llama-3.1-8B-Instruct gpqa-extended

TensorRT Engine Evaluation

Evaluate a pre-built TensorRT engine:

trtllm-eval --model /path/to/engine \
  --tokenizer meta-llama/Llama-3.1-8B-Instruct \
  --backend tensorrt \
  mmlu

Common Evaluation Tasks

Task	Description	Dataset
`mmlu`	Multi-task language understanding	57 subjects, multiple choice
`gsm8k`	Grade school math word problems	8.5K math problems
`cnn-dailymail`	Text summarization	News article summaries
`gpqa-diamond`	Graduate-level science questions	High-quality subset
`gpqa-main`	Graduate-level science questions	Main dataset
`gpqa-extended`	Graduate-level science questions	Extended dataset
`mmmu`	Multimodal understanding	Multi-discipline, multimodal
`longbench-v1`	Long-context understanding	V1 benchmark
`longbench-v2`	Long-context understanding	V2 benchmark
`json-mode`	JSON generation	Structured output

trtllm-serve - Serve models via API
trtllm-bench - Benchmark model performance
trtllm-build - Build TensorRT engines

Python API

CLI Tools

Configuration

Usage

Global Options

Model Configuration

Parallelism Options

Memory Options

Advanced Options

Evaluation Tasks

mmlu

gsm8k

cnn-dailymail

gpqa-diamond

gpqa-main

gpqa-extended

mmmu

longbench-v1

longbench-v2

json-mode

Examples

MMLU Evaluation

GSM8K with TensorRT Backend

Multi-GPU Evaluation

CNN/DailyMail Summarization

With Custom Configuration

LongBench Evaluation

Custom Tokenizer

High Batch Size for Speed

GPQA Benchmark Suite

TensorRT Engine Evaluation

Common Evaluation Tasks

Build docs developers (and LLMs) love

Python API

CLI Tools

Configuration

​Usage

​Global Options

​Model Configuration

​Parallelism Options

​Memory Options

​Advanced Options

​Evaluation Tasks

​mmlu

​gsm8k

​cnn-dailymail

​gpqa-diamond

​gpqa-main

​gpqa-extended

​mmmu

​longbench-v1

​longbench-v2

​json-mode

​Examples

​MMLU Evaluation

​GSM8K with TensorRT Backend

​Multi-GPU Evaluation

​CNN/DailyMail Summarization

​With Custom Configuration

​LongBench Evaluation

​Custom Tokenizer

​High Batch Size for Speed

​GPQA Benchmark Suite

​TensorRT Engine Evaluation

​Common Evaluation Tasks

​Related Commands

Build docs developers (and LLMs) love

Usage

Global Options

Model Configuration

Parallelism Options

Memory Options

Advanced Options

Evaluation Tasks

mmlu

gsm8k

cnn-dailymail

gpqa-diamond

gpqa-main

gpqa-extended

mmmu

longbench-v1

longbench-v2

json-mode

Examples

MMLU Evaluation

GSM8K with TensorRT Backend

Multi-GPU Evaluation

CNN/DailyMail Summarization

With Custom Configuration

LongBench Evaluation

Custom Tokenizer

High Batch Size for Speed

GPQA Benchmark Suite

TensorRT Engine Evaluation

Common Evaluation Tasks

Related Commands