LLM Arguments Configuration
This page documents all configuration options available throughLlmArgs (PyTorch backend) and TrtLlmArgs (TensorRT backend) classes.
Overview
LlmArgs is the main configuration class for TensorRT-LLM. It controls model loading, parallelism, quantization, KV caching, speculative decoding, and runtime behavior.
Base Arguments
These arguments are common to both PyTorch and TensorRT backends.Model and Tokenizer
The path to the model checkpoint or the model name from the Hugging Face Hub.Examples:
"meta-llama/Llama-2-7b-hf"(HuggingFace Hub)"/path/to/local/model"(Local directory)
The path to the tokenizer checkpoint or the tokenizer name from the Hugging Face Hub. If not specified, uses the model path.
The mode to initialize the tokenizer.
auto: Use fast tokenizer if availableslow: Force slow tokenizer
Specify a custom tokenizer implementation. Accepts either:
- A built-in alias (e.g.,
'deepseek_v32') - A Python import path (e.g.,
'tensorrt_llm.tokenizer.deepseek_v32.DeepseekV32Tokenizer')
from_pretrained(path, **kwargs) and the TokenizerBase interface.Whether to skip the tokenizer initialization.
Whether to trust remote code when loading models from Hugging Face Hub.
The data type to use for the model.Supported values:
"auto": Automatically select based on GPU capability"float16": FP16"bfloat16": BF16"float32": FP32
The revision to use for the model when loading from Hugging Face Hub.
The revision to use for the tokenizer when loading from Hugging Face Hub.
Optional parameters overriding model config defaults.Precedence: (1) model_kwargs, (2) model config file, (3) model config class defaults.
Unknown keys are ignored.
Parallelism Configuration
The tensor parallel size. Splits model layers across GPUs.Example: For a 70B model on 4 GPUs:
tensor_parallel_size=4The pipeline parallel size. Splits model layers into stages.Example: For a 70B model on 8 GPUs with TP=2, PP=4:
tensor_parallel_size=2, pipeline_parallel_size=4The context parallel size. Splits attention computation across GPUs for long sequences.
The number of GPUs per node. Defaults to
torch.cuda.device_count().The cluster parallel size for MoE models’s expert weights.
The tensor parallel size for MoE models’s expert weights.
The expert parallel size for MoE models’s expert weights.
Enable attention data parallel.
Enable LM head TP in attention dp.
Pipeline parallel partition, a list of each rank’s layer number.
Context parallel config.CpConfig fields:
cp_type: Context parallel type (default:ULYSSES)tokens_per_block: Number of tokens per block (used in HELIX)use_nccl_for_alltoall: Whether to use NCCL for alltoall (default:True)fifo_version: FIFO version for alltoall (default:2)cp_anchor_size: Anchor size for STAR attentionblock_size: Block size for STAR attention
Runtime Limits
The maximum batch size for inference.
The maximum input length (in tokens).
The maximum sequence length (input + output). If not specified, computed from other constraints.
The maximum beam width for beam search.
The maximum number of tokens to process in a single batch.
KV Cache Configuration
KV cache configuration. See KV Cache Configuration section below.
Enable chunked prefill. Splits long prompts into chunks for better GPU utilization.
LoRA Configuration
Enable LoRA (Low-Rank Adaptation) support.
LoRA configuration for the model.LoraConfig fields:
max_lora_rank: Maximum LoRA ranklora_dir: Directory containing LoRA weightslora_target_modules: Target modules for LoRA
Speculative Decoding
Speculative decoding configuration. Supports multiple speculation algorithms:Supported types:
DraftTargetDecodingConfig: Draft-target speculation with separate draft modelEagleDecodingConfig/Eagle3DecodingConfig: EAGLE speculationMedusaDecodingConfig: Medusa speculationLookaheadDecodingConfig: Lookahead speculationMTPDecodingConfig: MTP (Multi-Token Prediction) speculationNGramDecodingConfig: N-gram based speculationSADecodingConfig: Suffix Automaton speculationPARDDecodingConfig: PARD (Parallel Draft) speculationAutoDecodingConfig: Automatically select speculation algorithm
Scheduler Configuration
Scheduler configuration.SchedulerConfig fields:
capacity_scheduler_policy: The capacity scheduler policy (MAX_UTILIZATION,GUARANTEED_NO_EVICT,STATIC_BATCH)context_chunking_policy: Context chunking policy (FIRST_COME_FIRST_SERVED,EQUAL_PROGRESS)dynamic_batch_config: Dynamic batch configuration (TensorRT backend only)waiting_queue_policy: Waiting queue scheduling policy (default:FCFS)
Advanced Configuration
PEFT (Parameter-Efficient Fine-Tuning) cache configuration for LoRA adapters.PeftCacheConfig fields:
num_host_module_layer: Number of LoRA weights sets in host cachenum_device_module_layer: Number of LoRA weights sets in device cacheoptimal_adapter_size: Optimal adapter size for page width (default:8)max_adapter_size: Max supported adapter size (default:64)device_cache_percent: GPU memory fraction for device cache (default:0.02)host_cache_size: Host cache size in bytes (default:1GB)
Cache transceiver configuration for disaggregated serving.CacheTransceiverConfig fields:
backend: Communication backend (DEFAULT,UCX,NIXL,MOONCAKE,MPI)transceiver_runtime: Runtime implementation (CPP,PYTHON)max_tokens_in_buffer: Max tokens in transfer bufferkv_transfer_timeout_ms: Timeout for KV cache transfer
Sparse attention configuration.Supported types:
RocketSparseAttentionConfig: RocketKV sparse attentionDeepSeekSparseAttentionConfig: DeepSeek sparse attentionSkipSoftmaxAttentionConfig: Skip softmax attention
Guided decoding backend.
llguidance is supported in PyTorch backend only.Batched logits processor for custom token generation control.
Gather generation logits.
Orchestration
The orchestrator type to use. Defaults to
None, which uses MPI.Environment variable overrides.Note: Import-time-cached env vars in the code won’t update unless the code fetches them from
os.environ on demand.Performance and Monitoring
The maximum number of iterations for iter stats.
The maximum number of iterations for request stats.
Return performance metrics.
The maximum number of requests for perf metrics. Must also set
return_perf_metrics to true.Target URL to which OpenTelemetry traces will be sent.
Postprocessing
The number of processes used for postprocessing the generated tokens, including detokenization.
The path to the tokenizer directory for postprocessing.
The parser to separate reasoning content from output.
PyTorch Backend Arguments (TorchLlmArgs)
These arguments are specific to the PyTorch backend (backend="pytorch").
CUDA Graph Configuration
CUDA graph configuration for performance optimization.CudaGraphConfig fields:
batch_sizes: List of batch sizes to create CUDA graphs formax_batch_size: Maximum batch size for CUDA graphs (default:0)enable_padding: Round batches up to nearest cuda_graph_batch_size (default:false)
MoE Configuration
Mixture of Experts configuration.MoeConfig fields:
backend: MoE backend (AUTO,CUTLASS,CUTEDSL,WIDEEP,TRTLLM,DEEPGEMM,VANILLA,TRITON)max_num_tokens: Max tokens sent to MoE at onceload_balancer: MoE load balancing configurationdisable_finalize_fusion: Disable FC2+finalize fusion (default:false)use_low_precision_moe_combine: Use low precision combine for NVFP4 (default:false)
Quantization Configuration
NVFP4 GEMM backend configuration.Nvfp4GemmConfig fields:
allowed_backends: List of backends for auto-selection (default:['cutlass', 'cublaslt', 'cuda_core'])
Attention Configuration
Attention backend to use.Supported values:
"TRTLLM": TensorRT-LLM attention kernels- Other backend-specific options
Optimized load-balancing for the DP Attention scheduler.AttentionDpConfig fields:
enable_balance: Whether to enable balance (default:false)timeout_iters: Number of iterations to timeout (default:50)batching_wait_iters: Number of iterations to wait for batching (default:10)
Disable the overlap scheduler.
Sampling Configuration
The type of sampler to use.Options:
"TRTLLMSampler": TensorRT-LLM sampler"TorchSampler": PyTorch native sampler"auto": Automatically select (uses TorchSampler unless BeamSearch is requested)
Force usage of the async worker in the sampler for D2H copies.
Performance Tuning
Threshold for Python garbage collection of generation 0 objects. Lower values trigger more frequent GC.
If greater than 0, the request queue might wait up to this many milliseconds to receive
max_batch_size requests.Maximum number of iterations the scheduler will wait to accumulate new requests.
Token accumulation threshold ratio (0 to 1) for batch scheduling optimization.
Enable autotuner for all tunable ops. Performance may degrade if set to false.
Allreduce strategy to use.Options:
AUTO, NCCL, UB, MINLATENCY, ONESHOT, TWOSHOT, LOWPRECISION, MNNVL, NCCL_SYMMETRICTorch Compile
Torch compile configuration.TorchCompileConfig fields:
enable_fullgraph: Enable full graph compilation (default:true)enable_inductor: Enable inductor backend (default:false)enable_piecewise_cuda_graph: Enable piecewise CUDA graph (default:false)capture_num_tokens: List of num of tokens to capture CUDA graph forenable_userbuffers: Enable userbuffers (default:true)max_num_streams: Max CUDA streams (default:1)
Model Loading
How to load the model weights.Options:
"AUTO": Detect weight type from model checkpoint"DUMMY": Initialize all weights randomly"VISION_ONLY": Only load multimodal encoder weights
The format of the provided checkpoint. Can be a custom format registered with
register_checkpoint_loader.Custom checkpoint loader instance. If both
checkpoint_format and checkpoint_loader are provided, checkpoint_loader is ignored.Advanced PyTorch Features
The iteration interval to create responses under streaming mode. Set higher for large batches.
Enable iteration performance statistics.
Enable per-request stats per iteration. Must also set
enable_iter_perf_stats to true.Print iteration logs.
Enable layerwise NVTX markers for profiling.
Enable min-latency mode. Currently only used for Llama4.
Force dynamic quantization.
The config for KV cache connector.KvCacheConnectorConfig fields:
connector_module: Import path to connector moduleconnector_scheduler_class: Scheduler class nameconnector_worker_class: Worker class name
Only load/execute the vision encoder part of the model.
Full worker extension class name for extending RayGPUWorker functionality.
Placement config for RayGPUWorker. Only used with AsyncLLM and
orchestrator_type='ray'.Enable LLM sleep feature. Requires extra setup that may slow down model loading.
The max number of performance statistic entries.
Configuration for layer-wise benchmarks calibration.LayerwiseBenchmarksConfig fields:
calibration_mode:NONE,MARK, orCOLLECTcalibration_file_path: File path for calibration datacalibration_layer_indices: Layer indices to filter
TensorRT Backend Arguments (TrtLlmArgs)
These arguments are specific to the TensorRT backend.Build Configuration
The workspace directory for the model.
Enable tqdm for progress bar during engine build.
Enable fast build mode.
Build configuration for TensorRT engine.BuildConfig fields: See Build Configuration for details.
Enable build cache to reuse compiled engine components.
Quantization
Quantization configuration.QuantConfig fields:
quant_algo: Quantization algorithm (W8A16,W4A16,FP8,NVFP4,W4A16_AWQ, etc.)kv_cache_quant_algo: KV cache quantization algorithmgroup_size: Group size for group-wise quantization (default:128)smoothquant_val: Smoothing parameter alpha (default:0.5)clamp_val: Clamp values for FP8 rowwise quantizationhas_zero_point: Whether to use zero pointexclude_modules: Module name patterns to skip in quantization
Calibration configuration for quantization.CalibConfig fields:
device: Device for calibration (cudaorcpu)calib_dataset: Calibration dataset name or path (default:cnn_dailymail)calib_batches: Number of calibration batches (default:512)calib_batch_size: Calibration batch size (default:1)calib_max_seq_length: Max sequence length for calibration (default:512)random_seed: Random seed (default:1234)
Embedding Configuration
The embedding parallel mode.Options:
NONE: No parallelismSHARDING_ALONG_VOCAB: Shard along vocabulary dimensionSHARDING_ALONG_HIDDEN: Shard along hidden dimension
Prompt Adapter
Enable prompt adapter support.
The maximum number of prompt adapter tokens.
Runtime Configuration
Batching type.Options:
STATIC: Static batchingINFLIGHT: In-flight batching (continuous batching)
Normalize log probabilities.
Extended runtime performance knob configuration.ExtendedRuntimePerfKnobConfig fields:
multi_block_mode: Whether to use multi-block mode (default:true)enable_context_fmha_fp32_acc: Enable FP32 accumulation in context FMHA (default:false)cuda_graph_mode: Whether to use CUDA graph mode (default:false)cuda_graph_cache_size: Number of CUDA graphs to cache (default:0)
Fail fast when attention window is too large to fit even a single sequence in the KV cache.
KV Cache Configuration
TheKvCacheConfig class controls KV cache memory management.
Memory Configuration
The fraction of GPU memory to allocate for the KV cache (0.0 to 1.0).Example:
0.85 = 85% of free GPU memoryThe maximum number of tokens to store in the KV cache. If both
max_tokens and free_gpu_memory_fraction are specified, the minimum is used.The maximum size in bytes of GPU memory for the KV cache. If both this and
free_gpu_memory_fraction are specified, the minimum is allocated.Size of the host (CPU) cache in bytes.
Cache Reuse
Controls if KV cache blocks can be reused for different requests.
Whether blocks that are only partially matched can be reused.
Whether partially matched blocks that are in use can be reused after copying them.
Attention Window
Size of the attention window for each sequence. Only the last N tokens are stored.Example:
[2048] keeps only the last 2048 tokens per sequenceNumber of sink tokens (tokens to always keep in attention window).
Advanced Options
The number of tokens per block in paged attention.
Controls if blocks are onboarded.
Whether to use UVM (Unified Virtual Memory) for the KV cache.
The data type to use for the KV cache.Options:
auto, fp8, nvfp4, or valid torch dtype stringsNote: This is a PyTorch backend only field.The fraction of KV Cache memory reserved for cross attention (encoder-decoder models). Default is 50%.
Maximum size of the event buffer. If set to 0, the event buffer is not used.
Whether to use the KV cache manager v2 (experimental).
The maximum utilization of the KV cache for resume (0.0 to 1.0). Only used with KV cache manager v2.