Skip to main content
Build optimized TensorRT engines from model checkpoints for high-performance inference. This command compiles model checkpoints into TensorRT engines with various optimization options.

Usage

trtllm-build [OPTIONS]

Input Options

--checkpoint_dir
string
The directory path that contains TensorRT-LLM checkpoint
--model_config
string
The file path that saves TensorRT-LLM checkpoint config
--build_config
string
The file path that saves TensorRT-LLM build config
--model_cls_file
string
The file path that defines customized TensorRT-LLM model
--model_cls_name
string
The customized TensorRT-LLM model class name

Output Options

--output_dir
string
default:"engine_outputs"
The directory path to save the serialized engine files and engine config file

Engine Configuration

--max_batch_size
integer
default:"8"
Maximum number of requests that the engine can schedule
--max_input_len
integer
default:"1024"
Maximum input length of one request
--max_seq_len
integer
Maximum total length of one request, including prompt and outputs. If unspecified, the value is deduced from the model config. Alias: --max_decoder_seq_len
--max_beam_width
integer
default:"1"
Maximum number of beams for beam search decoding
--max_num_tokens
integer
default:"2048"
Maximum number of batched input tokens after padding is removed in each batch. Currently, the input padding is removed by default
--opt_num_tokens
integer
Optimal number of batched input tokens after padding is removed in each batch. It equals to max_batch_size * max_beam_width by default. Set this value as close as possible to the actual number of tokens on your workload
--max_encoder_input_len
integer
default:"1024"
Maximum encoder input length for encoder-decoder models. Set max_input_len to 1 to start generation from decoder_start_token_id of length 1
--max_prompt_embedding_table_size
integer
default:"0"
Maximum prompt embedding table size for prompt tuning, or maximum multimodal input size for multimodal models. Setting a value > 0 enables prompt tuning or multimodal input. Alias: --max_multimodal_len

KV Cache Options

--kv_cache_type
string
Set KV cache type. Choices: continuous, paged, disabled. For disabled case, KV cache is disabled and only context phase is allowed
--paged_kv_cache
string
Deprecated. Enabling this option is equivalent to --kv_cache_type paged for transformer based models

Build Optimization

--input_timing_cache
string
The file path to read the timing cache. This option is ignored if the file does not exist
--output_timing_cache
string
The file path to write the timing cache
--profiling_verbosity
string
default:"layer_names_only"
The profiling verbosity for the generated TensorRT engine. Setting to detailed allows inspecting tactic choices and kernel parameters. Choices: layer_names_only, detailed, none
--strip_plan
boolean
default:"false"
Enable stripping weights from the final TensorRT engine under the assumption that the refit weights are identical to those provided at build time
--weight_sparsity
boolean
default:"false"
Enable weight sparsity
--weight_streaming
boolean
default:"false"
Enable offloading weights to CPU and streaming loading at runtime
--fast_build
boolean
default:"false"
Enable features for faster engine building. This may cause some performance degradation and is currently incompatible with int8/int4 quantization without plugin

Build Process Control

--workers
integer
default:"1"
The number of workers for building in parallel
--log_level
string
default:"info"
The logging level. Choices: verbose, info, warning, error, internal_error
--enable_debug_output
boolean
default:"false"
Enable debug output
--visualize_network
string
The directory path to export TensorRT Network as ONNX prior to Engine build for debugging
--dry_run
boolean
default:"false"
Run through the build process except the actual Engine build for debugging
--monitor_memory
boolean
default:"false"
Enable memory monitor during Engine build

Logits Options

--logits_dtype
string
The data type of logits. Choices: float16, float32
--gather_context_logits
boolean
default:"false"
Enable gathering context logits
--gather_generation_logits
boolean
default:"false"
Enable gathering generation logits (deprecated, use runtime flag instead)
--gather_all_token_logits
boolean
default:"false"
Enable both gather_context_logits and gather_generation_logits

LoRA Options

--lora_dir
string
The directory of LoRA weights. If multiple directories are provided, the first one is used for configuration
--lora_ckpt_source
string
default:"hf"
The source type of LoRA checkpoint. Choices: hf, nemo
--lora_target_modules
string
The target module names that LoRA is applied. Only effective when lora_plugin is enabled. Choices: attn_qkv, attn_q, attn_k, attn_v, attn_dense, mlp_h_to_4h, mlp_4h_to_h, mlp_gate, cross_attn_qkv, cross_attn_q, cross_attn_k, cross_attn_v, cross_attn_dense, moe_h_to_4h, moe_4h_to_h, moe_gate, moe_router
--max_lora_rank
integer
default:"64"
Maximum LoRA rank for different LoRA modules. It is used to compute the workspace size of LoRA plugin

Speculative Decoding Options

--speculative_decoding_mode
string
Mode of speculative decoding. Choices: draft_tokens_external, lookahead_decoding, medusa, explicit_draft_tokens, eagle
--max_draft_len
integer
default:"0"
Maximum lengths of draft tokens for speculative decoding target model

Plugin Configuration

The build command supports various plugin configuration options. Use --help to see the full list of plugin options including:
  • --gpt_attention_plugin
  • --gemm_plugin
  • --lora_plugin
  • --context_fmha
  • --remove_input_padding
  • And many more

Examples

Basic Engine Build

Build a basic TensorRT engine from a checkpoint:
trtllm-build --checkpoint_dir ./model_checkpoint \
  --output_dir ./engine_outputs

Custom Batch and Sequence Length

Build with specific batch size and sequence length limits:
trtllm-build --checkpoint_dir ./model_checkpoint \
  --output_dir ./engine_outputs \
  --max_batch_size 64 \
  --max_input_len 2048 \
  --max_seq_len 4096

Parallel Build

Build with multiple workers for faster compilation:
trtllm-build --checkpoint_dir ./model_checkpoint \
  --output_dir ./engine_outputs \
  --workers 4

Enable Weight Sparsity

Build with weight sparsity optimization:
trtllm-build --checkpoint_dir ./model_checkpoint \
  --output_dir ./engine_outputs \
  --weight_sparsity

Build with Timing Cache

Use timing cache for faster subsequent builds:
trtllm-build --checkpoint_dir ./model_checkpoint \
  --output_dir ./engine_outputs \
  --input_timing_cache ./timing_cache.bin \
  --output_timing_cache ./timing_cache.bin

Enable LoRA Support

Build engine with LoRA adapter support:
trtllm-build --checkpoint_dir ./model_checkpoint \
  --output_dir ./engine_outputs \
  --lora_dir ./lora_weights \
  --max_lora_rank 64 \
  --lora_target_modules attn_qkv attn_dense mlp_h_to_4h mlp_4h_to_h

Fast Build Mode

Enable fast build for quicker iteration during development:
trtllm-build --checkpoint_dir ./model_checkpoint \
  --output_dir ./engine_outputs \
  --fast_build

Build with Speculative Decoding

Enable Medusa speculative decoding:
trtllm-build --checkpoint_dir ./model_checkpoint \
  --output_dir ./engine_outputs \
  --speculative_decoding_mode medusa \
  --max_draft_len 5

Build for Multimodal Models

Build engine for vision-language models:
trtllm-build --checkpoint_dir ./vlm_checkpoint \
  --output_dir ./engine_outputs \
  --max_prompt_embedding_table_size 4096 \
  --max_batch_size 16

Detailed Profiling

Build with detailed profiling information:
trtllm-build --checkpoint_dir ./model_checkpoint \
  --output_dir ./engine_outputs \
  --profiling_verbosity detailed

Debug Build

Build with debugging and visualization:
trtllm-build --checkpoint_dir ./model_checkpoint \
  --output_dir ./engine_outputs \
  --enable_debug_output \
  --visualize_network ./network_visualization \
  --monitor_memory

Build from Model Config

Build using separate model and build config files:
trtllm-build --model_config ./model_config.json \
  --build_config ./build_config.json \
  --output_dir ./engine_outputs

Build Process

The build process involves:
  1. Loading checkpoint - Reads model weights and configuration
  2. Network construction - Builds TensorRT network with optimizations
  3. Engine compilation - Compiles network into optimized TensorRT engine
  4. Serialization - Saves engine files to output directory

Output Files

The --output_dir will contain:
  • rank*.engine - Serialized TensorRT engines (one per GPU rank)
  • config.json - Engine configuration
  • Timing cache (if --output_timing_cache is specified)

Performance Tips

  • Use --workers to parallelize builds across multiple GPUs
  • Enable --weight_sparsity for sparse models
  • Use timing cache for faster subsequent builds
  • Set --max_batch_size and --max_num_tokens based on your workload
  • Enable --fast_build during development, disable for production

Build docs developers (and LLMs) love