Usage
Input Options
The directory path that contains TensorRT-LLM checkpoint
The file path that saves TensorRT-LLM checkpoint config
The file path that saves TensorRT-LLM build config
The file path that defines customized TensorRT-LLM model
The customized TensorRT-LLM model class name
Output Options
The directory path to save the serialized engine files and engine config file
Engine Configuration
Maximum number of requests that the engine can schedule
Maximum input length of one request
Maximum total length of one request, including prompt and outputs. If unspecified, the value is deduced from the model config. Alias:
--max_decoder_seq_lenMaximum number of beams for beam search decoding
Maximum number of batched input tokens after padding is removed in each batch. Currently, the input padding is removed by default
Optimal number of batched input tokens after padding is removed in each batch. It equals to
max_batch_size * max_beam_width by default. Set this value as close as possible to the actual number of tokens on your workloadMaximum encoder input length for encoder-decoder models. Set
max_input_len to 1 to start generation from decoder_start_token_id of length 1Maximum prompt embedding table size for prompt tuning, or maximum multimodal input size for multimodal models. Setting a value > 0 enables prompt tuning or multimodal input. Alias:
--max_multimodal_lenKV Cache Options
Set KV cache type. Choices:
continuous, paged, disabled. For disabled case, KV cache is disabled and only context phase is allowedDeprecated. Enabling this option is equivalent to
--kv_cache_type paged for transformer based modelsBuild Optimization
The file path to read the timing cache. This option is ignored if the file does not exist
The file path to write the timing cache
The profiling verbosity for the generated TensorRT engine. Setting to detailed allows inspecting tactic choices and kernel parameters. Choices:
layer_names_only, detailed, noneEnable stripping weights from the final TensorRT engine under the assumption that the refit weights are identical to those provided at build time
Enable weight sparsity
Enable offloading weights to CPU and streaming loading at runtime
Enable features for faster engine building. This may cause some performance degradation and is currently incompatible with int8/int4 quantization without plugin
Build Process Control
The number of workers for building in parallel
The logging level. Choices:
verbose, info, warning, error, internal_errorEnable debug output
The directory path to export TensorRT Network as ONNX prior to Engine build for debugging
Run through the build process except the actual Engine build for debugging
Enable memory monitor during Engine build
Logits Options
The data type of logits. Choices:
float16, float32Enable gathering context logits
Enable gathering generation logits (deprecated, use runtime flag instead)
Enable both
gather_context_logits and gather_generation_logitsLoRA Options
The directory of LoRA weights. If multiple directories are provided, the first one is used for configuration
The source type of LoRA checkpoint. Choices:
hf, nemoThe target module names that LoRA is applied. Only effective when
lora_plugin is enabled. Choices: attn_qkv, attn_q, attn_k, attn_v, attn_dense, mlp_h_to_4h, mlp_4h_to_h, mlp_gate, cross_attn_qkv, cross_attn_q, cross_attn_k, cross_attn_v, cross_attn_dense, moe_h_to_4h, moe_4h_to_h, moe_gate, moe_routerMaximum LoRA rank for different LoRA modules. It is used to compute the workspace size of LoRA plugin
Speculative Decoding Options
Mode of speculative decoding. Choices:
draft_tokens_external, lookahead_decoding, medusa, explicit_draft_tokens, eagleMaximum lengths of draft tokens for speculative decoding target model
Plugin Configuration
The build command supports various plugin configuration options. Use--help to see the full list of plugin options including:
--gpt_attention_plugin--gemm_plugin--lora_plugin--context_fmha--remove_input_padding- And many more
Examples
Basic Engine Build
Build a basic TensorRT engine from a checkpoint:Custom Batch and Sequence Length
Build with specific batch size and sequence length limits:Parallel Build
Build with multiple workers for faster compilation:Enable Weight Sparsity
Build with weight sparsity optimization:Build with Timing Cache
Use timing cache for faster subsequent builds:Enable LoRA Support
Build engine with LoRA adapter support:Fast Build Mode
Enable fast build for quicker iteration during development:Build with Speculative Decoding
Enable Medusa speculative decoding:Build for Multimodal Models
Build engine for vision-language models:Detailed Profiling
Build with detailed profiling information:Debug Build
Build with debugging and visualization:Build from Model Config
Build using separate model and build config files:Build Process
The build process involves:- Loading checkpoint - Reads model weights and configuration
- Network construction - Builds TensorRT network with optimizations
- Engine compilation - Compiles network into optimized TensorRT engine
- Serialization - Saves engine files to output directory
Output Files
The--output_dir will contain:
rank*.engine- Serialized TensorRT engines (one per GPU rank)config.json- Engine configuration- Timing cache (if
--output_timing_cacheis specified)
Performance Tips
- Use
--workersto parallelize builds across multiple GPUs - Enable
--weight_sparsityfor sparse models - Use timing cache for faster subsequent builds
- Set
--max_batch_sizeand--max_num_tokensbased on your workload - Enable
--fast_buildduring development, disable for production
Related Commands
- trtllm-serve - Serve the built engines
- trtllm-bench - Benchmark engine performance
- trtllm-eval - Evaluate model accuracy