trtllm-build

Build optimized TensorRT engines from model checkpoints for high-performance inference. This command compiles model checkpoints into TensorRT engines with various optimization options.

Usage

trtllm-build [OPTIONS]

Input Options

--checkpoint_dir

string

The directory path that contains TensorRT-LLM checkpoint

--model_config

string

The file path that saves TensorRT-LLM checkpoint config

--build_config

string

The file path that saves TensorRT-LLM build config

--model_cls_file

string

The file path that defines customized TensorRT-LLM model

--model_cls_name

string

The customized TensorRT-LLM model class name

Output Options

--output_dir

string

default:"engine_outputs"

The directory path to save the serialized engine files and engine config file

Engine Configuration

--max_batch_size

integer

default:"8"

Maximum number of requests that the engine can schedule

--max_input_len

integer

default:"1024"

Maximum input length of one request

--max_seq_len

integer

Maximum total length of one request, including prompt and outputs. If unspecified, the value is deduced from the model config. Alias: --max_decoder_seq_len

--max_beam_width

integer

default:"1"

Maximum number of beams for beam search decoding

--max_num_tokens

integer

default:"2048"

Maximum number of batched input tokens after padding is removed in each batch. Currently, the input padding is removed by default

--opt_num_tokens

integer

Optimal number of batched input tokens after padding is removed in each batch. It equals to max_batch_size * max_beam_width by default. Set this value as close as possible to the actual number of tokens on your workload

--max_encoder_input_len

integer

default:"1024"

Maximum encoder input length for encoder-decoder models. Set max_input_len to 1 to start generation from decoder_start_token_id of length 1

--max_prompt_embedding_table_size

integer

default:"0"

Maximum prompt embedding table size for prompt tuning, or maximum multimodal input size for multimodal models. Setting a value > 0 enables prompt tuning or multimodal input. Alias: --max_multimodal_len

KV Cache Options

--kv_cache_type

string

Set KV cache type. Choices: continuous, paged, disabled. For disabled case, KV cache is disabled and only context phase is allowed

--paged_kv_cache

string

Deprecated. Enabling this option is equivalent to --kv_cache_type paged for transformer based models

Build Optimization

--input_timing_cache

string

The file path to read the timing cache. This option is ignored if the file does not exist

--output_timing_cache

string

The file path to write the timing cache

--profiling_verbosity

string

default:"layer_names_only"

The profiling verbosity for the generated TensorRT engine. Setting to detailed allows inspecting tactic choices and kernel parameters. Choices: layer_names_only, detailed, none

--strip_plan

boolean

default:"false"

Enable stripping weights from the final TensorRT engine under the assumption that the refit weights are identical to those provided at build time

--weight_sparsity

boolean

default:"false"

Enable weight sparsity

--weight_streaming

boolean

default:"false"

Enable offloading weights to CPU and streaming loading at runtime

--fast_build

boolean

default:"false"

Enable features for faster engine building. This may cause some performance degradation and is currently incompatible with int8/int4 quantization without plugin

Build Process Control

--workers

integer

default:"1"

The number of workers for building in parallel

--log_level

string

default:"info"

The logging level. Choices: verbose, info, warning, error, internal_error

--enable_debug_output

boolean

default:"false"

Enable debug output

--visualize_network

string

The directory path to export TensorRT Network as ONNX prior to Engine build for debugging

--dry_run

boolean

default:"false"

Run through the build process except the actual Engine build for debugging

--monitor_memory

boolean

default:"false"

Enable memory monitor during Engine build

Logits Options

--logits_dtype

string

The data type of logits. Choices: float16, float32

--gather_context_logits

boolean

default:"false"

Enable gathering context logits

--gather_generation_logits

boolean

default:"false"

Enable gathering generation logits (deprecated, use runtime flag instead)

--gather_all_token_logits

boolean

default:"false"

Enable both gather_context_logits and gather_generation_logits

LoRA Options

--lora_dir

string

The directory of LoRA weights. If multiple directories are provided, the first one is used for configuration

--lora_ckpt_source

string

default:"hf"

The source type of LoRA checkpoint. Choices: hf, nemo

--lora_target_modules

string

The target module names that LoRA is applied. Only effective when lora_plugin is enabled. Choices: attn_qkv, attn_q, attn_k, attn_v, attn_dense, mlp_h_to_4h, mlp_4h_to_h, mlp_gate, cross_attn_qkv, cross_attn_q, cross_attn_k, cross_attn_v, cross_attn_dense, moe_h_to_4h, moe_4h_to_h, moe_gate, moe_router

--max_lora_rank

integer

default:"64"

Maximum LoRA rank for different LoRA modules. It is used to compute the workspace size of LoRA plugin

Speculative Decoding Options

--speculative_decoding_mode

string

Mode of speculative decoding. Choices: draft_tokens_external, lookahead_decoding, medusa, explicit_draft_tokens, eagle

--max_draft_len

integer

default:"0"

Maximum lengths of draft tokens for speculative decoding target model

Plugin Configuration

The build command supports various plugin configuration options. Use --help to see the full list of plugin options including:

--gpt_attention_plugin
--gemm_plugin
--lora_plugin
--context_fmha
--remove_input_padding
And many more

Examples

Basic Engine Build

Build a basic TensorRT engine from a checkpoint:

trtllm-build --checkpoint_dir ./model_checkpoint \
  --output_dir ./engine_outputs

Custom Batch and Sequence Length

Build with specific batch size and sequence length limits:

trtllm-build --checkpoint_dir ./model_checkpoint \
  --output_dir ./engine_outputs \
  --max_batch_size 64 \
  --max_input_len 2048 \
  --max_seq_len 4096

Parallel Build

Build with multiple workers for faster compilation:

trtllm-build --checkpoint_dir ./model_checkpoint \
  --output_dir ./engine_outputs \
  --workers 4

Enable Weight Sparsity

Build with weight sparsity optimization:

trtllm-build --checkpoint_dir ./model_checkpoint \
  --output_dir ./engine_outputs \
  --weight_sparsity

Build with Timing Cache

Use timing cache for faster subsequent builds:

trtllm-build --checkpoint_dir ./model_checkpoint \
  --output_dir ./engine_outputs \
  --input_timing_cache ./timing_cache.bin \
  --output_timing_cache ./timing_cache.bin

Enable LoRA Support

Build engine with LoRA adapter support:

trtllm-build --checkpoint_dir ./model_checkpoint \
  --output_dir ./engine_outputs \
  --lora_dir ./lora_weights \
  --max_lora_rank 64 \
  --lora_target_modules attn_qkv attn_dense mlp_h_to_4h mlp_4h_to_h

Fast Build Mode

Enable fast build for quicker iteration during development:

trtllm-build --checkpoint_dir ./model_checkpoint \
  --output_dir ./engine_outputs \
  --fast_build

Build with Speculative Decoding

Enable Medusa speculative decoding:

trtllm-build --checkpoint_dir ./model_checkpoint \
  --output_dir ./engine_outputs \
  --speculative_decoding_mode medusa \
  --max_draft_len 5

Build for Multimodal Models

Build engine for vision-language models:

trtllm-build --checkpoint_dir ./vlm_checkpoint \
  --output_dir ./engine_outputs \
  --max_prompt_embedding_table_size 4096 \
  --max_batch_size 16

Detailed Profiling

Build with detailed profiling information:

trtllm-build --checkpoint_dir ./model_checkpoint \
  --output_dir ./engine_outputs \
  --profiling_verbosity detailed

Debug Build

Build with debugging and visualization:

trtllm-build --checkpoint_dir ./model_checkpoint \
  --output_dir ./engine_outputs \
  --enable_debug_output \
  --visualize_network ./network_visualization \
  --monitor_memory

Build from Model Config

Build using separate model and build config files:

trtllm-build --model_config ./model_config.json \
  --build_config ./build_config.json \
  --output_dir ./engine_outputs

Build Process

The build process involves:

Loading checkpoint - Reads model weights and configuration
Network construction - Builds TensorRT network with optimizations
Engine compilation - Compiles network into optimized TensorRT engine
Serialization - Saves engine files to output directory

Output Files

The --output_dir will contain:

rank*.engine - Serialized TensorRT engines (one per GPU rank)
config.json - Engine configuration
Timing cache (if --output_timing_cache is specified)

Performance Tips

Use --workers to parallelize builds across multiple GPUs
Enable --weight_sparsity for sparse models
Use timing cache for faster subsequent builds
Set --max_batch_size and --max_num_tokens based on your workload
Enable --fast_build during development, disable for production

trtllm-serve - Serve the built engines
trtllm-bench - Benchmark engine performance
trtllm-eval - Evaluate model accuracy

Python API

CLI Tools

Configuration

Usage

Input Options

Output Options

Engine Configuration

KV Cache Options

Build Optimization

Build Process Control

Logits Options

LoRA Options

Speculative Decoding Options

Plugin Configuration

Examples

Basic Engine Build

Custom Batch and Sequence Length

Parallel Build

Enable Weight Sparsity

Build with Timing Cache

Enable LoRA Support

Fast Build Mode

Build with Speculative Decoding

Build for Multimodal Models

Detailed Profiling

Debug Build

Build from Model Config

Build Process

Output Files

Performance Tips

Build docs developers (and LLMs) love

Python API

CLI Tools

Configuration

​Usage

​Input Options

​Output Options

​Engine Configuration

​KV Cache Options

​Build Optimization

​Build Process Control

​Logits Options

​LoRA Options

​Speculative Decoding Options

​Plugin Configuration

​Examples

​Basic Engine Build

​Custom Batch and Sequence Length

​Parallel Build

​Enable Weight Sparsity

​Build with Timing Cache

​Enable LoRA Support

​Fast Build Mode

​Build with Speculative Decoding

​Build for Multimodal Models

​Detailed Profiling

​Debug Build

​Build from Model Config

​Build Process

​Output Files

​Performance Tips

​Related Commands

Build docs developers (and LLMs) love

Usage

Input Options

Output Options

Engine Configuration

KV Cache Options

Build Optimization

Build Process Control

Logits Options

LoRA Options

Speculative Decoding Options

Plugin Configuration

Examples

Basic Engine Build

Custom Batch and Sequence Length

Parallel Build

Enable Weight Sparsity

Build with Timing Cache

Enable LoRA Support

Fast Build Mode

Build with Speculative Decoding

Build for Multimodal Models

Detailed Profiling

Debug Build

Build from Model Config

Build Process

Output Files

Performance Tips

Related Commands