Skip to main content

LLM Arguments Configuration

This page documents all configuration options available through LlmArgs (PyTorch backend) and TrtLlmArgs (TensorRT backend) classes.

Overview

LlmArgs is the main configuration class for TensorRT-LLM. It controls model loading, parallelism, quantization, KV caching, speculative decoding, and runtime behavior.
from tensorrt_llm import LLM
from tensorrt_llm.llmapi import LlmArgs, KvCacheConfig

args = LlmArgs(
    model="meta-llama/Llama-2-7b-hf",
    tensor_parallel_size=2,
    dtype="bfloat16",
    kv_cache_config=KvCacheConfig(
        free_gpu_memory_fraction=0.85
    )
)

llm = LLM(args)

Base Arguments

These arguments are common to both PyTorch and TensorRT backends.

Model and Tokenizer

model
Union[str, Path]
required
The path to the model checkpoint or the model name from the Hugging Face Hub.Examples:
  • "meta-llama/Llama-2-7b-hf" (HuggingFace Hub)
  • "/path/to/local/model" (Local directory)
tokenizer
Optional[Union[str, Path, TokenizerBase, PreTrainedTokenizerBase]]
The path to the tokenizer checkpoint or the tokenizer name from the Hugging Face Hub. If not specified, uses the model path.
tokenizer_mode
Literal['auto', 'slow']
default:"auto"
The mode to initialize the tokenizer.
  • auto: Use fast tokenizer if available
  • slow: Force slow tokenizer
custom_tokenizer
Optional[str]
Specify a custom tokenizer implementation. Accepts either:
  • A built-in alias (e.g., 'deepseek_v32')
  • A Python import path (e.g., 'tensorrt_llm.tokenizer.deepseek_v32.DeepseekV32Tokenizer')
The tokenizer class must implement from_pretrained(path, **kwargs) and the TokenizerBase interface.
skip_tokenizer_init
bool
default:false
Whether to skip the tokenizer initialization.
trust_remote_code
bool
default:false
Whether to trust remote code when loading models from Hugging Face Hub.
dtype
str
default:"auto"
The data type to use for the model.Supported values:
  • "auto": Automatically select based on GPU capability
  • "float16": FP16
  • "bfloat16": BF16
  • "float32": FP32
revision
Optional[str]
The revision to use for the model when loading from Hugging Face Hub.
tokenizer_revision
Optional[str]
The revision to use for the tokenizer when loading from Hugging Face Hub.
model_kwargs
Optional[Dict[str, Any]]
Optional parameters overriding model config defaults.Precedence: (1) model_kwargs, (2) model config file, (3) model config class defaults. Unknown keys are ignored.

Parallelism Configuration

tensor_parallel_size
int
default:1
The tensor parallel size. Splits model layers across GPUs.Example: For a 70B model on 4 GPUs: tensor_parallel_size=4
pipeline_parallel_size
int
default:1
The pipeline parallel size. Splits model layers into stages.Example: For a 70B model on 8 GPUs with TP=2, PP=4: tensor_parallel_size=2, pipeline_parallel_size=4
context_parallel_size
int
default:1
The context parallel size. Splits attention computation across GPUs for long sequences.
gpus_per_node
Optional[int]
default:"auto"
The number of GPUs per node. Defaults to torch.cuda.device_count().
moe_cluster_parallel_size
Optional[int]
The cluster parallel size for MoE models’s expert weights.
moe_tensor_parallel_size
Optional[int]
The tensor parallel size for MoE models’s expert weights.
moe_expert_parallel_size
Optional[int]
The expert parallel size for MoE models’s expert weights.
enable_attention_dp
bool
default:false
Enable attention data parallel.
enable_lm_head_tp_in_adp
bool
default:false
Enable LM head TP in attention dp.
pp_partition
Optional[List[int]]
Pipeline parallel partition, a list of each rank’s layer number.
cp_config
Optional[CpConfig]
Context parallel config.CpConfig fields:
  • cp_type: Context parallel type (default: ULYSSES)
  • tokens_per_block: Number of tokens per block (used in HELIX)
  • use_nccl_for_alltoall: Whether to use NCCL for alltoall (default: True)
  • fifo_version: FIFO version for alltoall (default: 2)
  • cp_anchor_size: Anchor size for STAR attention
  • block_size: Block size for STAR attention

Runtime Limits

max_batch_size
Optional[int]
default:2048
The maximum batch size for inference.
max_input_len
Optional[int]
default:1024
The maximum input length (in tokens).
max_seq_len
Optional[int]
The maximum sequence length (input + output). If not specified, computed from other constraints.
max_beam_width
Optional[int]
default:1
The maximum beam width for beam search.
max_num_tokens
Optional[int]
default:8192
The maximum number of tokens to process in a single batch.

KV Cache Configuration

kv_cache_config
KvCacheConfig
KV cache configuration. See KV Cache Configuration section below.
enable_chunked_prefill
bool
default:false
Enable chunked prefill. Splits long prompts into chunks for better GPU utilization.

LoRA Configuration

enable_lora
bool
default:false
Enable LoRA (Low-Rank Adaptation) support.
lora_config
Optional[LoraConfig]
LoRA configuration for the model.LoraConfig fields:
  • max_lora_rank: Maximum LoRA rank
  • lora_dir: Directory containing LoRA weights
  • lora_target_modules: Target modules for LoRA

Speculative Decoding

speculative_config
Optional[SpeculativeConfig]
Speculative decoding configuration. Supports multiple speculation algorithms:Supported types:
  • DraftTargetDecodingConfig: Draft-target speculation with separate draft model
  • EagleDecodingConfig / Eagle3DecodingConfig: EAGLE speculation
  • MedusaDecodingConfig: Medusa speculation
  • LookaheadDecodingConfig: Lookahead speculation
  • MTPDecodingConfig: MTP (Multi-Token Prediction) speculation
  • NGramDecodingConfig: N-gram based speculation
  • SADecodingConfig: Suffix Automaton speculation
  • PARDDecodingConfig: PARD (Parallel Draft) speculation
  • AutoDecodingConfig: Automatically select speculation algorithm
See Speculative Decoding for details.

Scheduler Configuration

scheduler_config
SchedulerConfig
Scheduler configuration.SchedulerConfig fields:
  • capacity_scheduler_policy: The capacity scheduler policy (MAX_UTILIZATION, GUARANTEED_NO_EVICT, STATIC_BATCH)
  • context_chunking_policy: Context chunking policy (FIRST_COME_FIRST_SERVED, EQUAL_PROGRESS)
  • dynamic_batch_config: Dynamic batch configuration (TensorRT backend only)
  • waiting_queue_policy: Waiting queue scheduling policy (default: FCFS)

Advanced Configuration

peft_cache_config
Optional[PeftCacheConfig]
PEFT (Parameter-Efficient Fine-Tuning) cache configuration for LoRA adapters.PeftCacheConfig fields:
  • num_host_module_layer: Number of LoRA weights sets in host cache
  • num_device_module_layer: Number of LoRA weights sets in device cache
  • optimal_adapter_size: Optimal adapter size for page width (default: 8)
  • max_adapter_size: Max supported adapter size (default: 64)
  • device_cache_percent: GPU memory fraction for device cache (default: 0.02)
  • host_cache_size: Host cache size in bytes (default: 1GB)
cache_transceiver_config
Optional[CacheTransceiverConfig]
Cache transceiver configuration for disaggregated serving.CacheTransceiverConfig fields:
  • backend: Communication backend (DEFAULT, UCX, NIXL, MOONCAKE, MPI)
  • transceiver_runtime: Runtime implementation (CPP, PYTHON)
  • max_tokens_in_buffer: Max tokens in transfer buffer
  • kv_transfer_timeout_ms: Timeout for KV cache transfer
sparse_attention_config
Optional[SparseAttentionConfig]
Sparse attention configuration.Supported types:
  • RocketSparseAttentionConfig: RocketKV sparse attention
  • DeepSeekSparseAttentionConfig: DeepSeek sparse attention
  • SkipSoftmaxAttentionConfig: Skip softmax attention
guided_decoding_backend
Optional[Literal['xgrammar', 'llguidance']]
Guided decoding backend. llguidance is supported in PyTorch backend only.
batched_logits_processor
Optional[BatchedLogitsProcessor]
Batched logits processor for custom token generation control.
gather_generation_logits
bool
default:false
Gather generation logits.

Orchestration

orchestrator_type
Optional[Literal['rpc', 'ray']]
The orchestrator type to use. Defaults to None, which uses MPI.
env_overrides
Optional[Dict[str, str]]
Environment variable overrides.Note: Import-time-cached env vars in the code won’t update unless the code fetches them from os.environ on demand.

Performance and Monitoring

iter_stats_max_iterations
Optional[int]
The maximum number of iterations for iter stats.
request_stats_max_iterations
Optional[int]
The maximum number of iterations for request stats.
return_perf_metrics
bool
default:false
Return performance metrics.
perf_metrics_max_requests
int
default:0
The maximum number of requests for perf metrics. Must also set return_perf_metrics to true.
otlp_traces_endpoint
Optional[str]
Target URL to which OpenTelemetry traces will be sent.

Postprocessing

num_postprocess_workers
int
default:0
The number of processes used for postprocessing the generated tokens, including detokenization.
postprocess_tokenizer_dir
Optional[str]
The path to the tokenizer directory for postprocessing.
reasoning_parser
Optional[str]
The parser to separate reasoning content from output.

PyTorch Backend Arguments (TorchLlmArgs)

These arguments are specific to the PyTorch backend (backend="pytorch").

CUDA Graph Configuration

cuda_graph_config
Optional[CudaGraphConfig]
CUDA graph configuration for performance optimization.CudaGraphConfig fields:
  • batch_sizes: List of batch sizes to create CUDA graphs for
  • max_batch_size: Maximum batch size for CUDA graphs (default: 0)
  • enable_padding: Round batches up to nearest cuda_graph_batch_size (default: false)
Example:
CudaGraphConfig(
    max_batch_size=128,
    enable_padding=True
)

MoE Configuration

moe_config
MoeConfig
Mixture of Experts configuration.MoeConfig fields:
  • backend: MoE backend (AUTO, CUTLASS, CUTEDSL, WIDEEP, TRTLLM, DEEPGEMM, VANILLA, TRITON)
  • max_num_tokens: Max tokens sent to MoE at once
  • load_balancer: MoE load balancing configuration
  • disable_finalize_fusion: Disable FC2+finalize fusion (default: false)
  • use_low_precision_moe_combine: Use low precision combine for NVFP4 (default: false)

Quantization Configuration

nvfp4_gemm_config
Nvfp4GemmConfig
NVFP4 GEMM backend configuration.Nvfp4GemmConfig fields:
  • allowed_backends: List of backends for auto-selection (default: ['cutlass', 'cublaslt', 'cuda_core'])

Attention Configuration

attn_backend
str
default:"TRTLLM"
Attention backend to use.Supported values:
  • "TRTLLM": TensorRT-LLM attention kernels
  • Other backend-specific options
attention_dp_config
Optional[AttentionDpConfig]
Optimized load-balancing for the DP Attention scheduler.AttentionDpConfig fields:
  • enable_balance: Whether to enable balance (default: false)
  • timeout_iters: Number of iterations to timeout (default: 50)
  • batching_wait_iters: Number of iterations to wait for batching (default: 10)
disable_overlap_scheduler
bool
default:false
Disable the overlap scheduler.

Sampling Configuration

sampler_type
Union[str, SamplerType]
default:"auto"
The type of sampler to use.Options:
  • "TRTLLMSampler": TensorRT-LLM sampler
  • "TorchSampler": PyTorch native sampler
  • "auto": Automatically select (uses TorchSampler unless BeamSearch is requested)
sampler_force_async_worker
bool
default:false
Force usage of the async worker in the sampler for D2H copies.

Performance Tuning

garbage_collection_gen0_threshold
int
default:20000
Threshold for Python garbage collection of generation 0 objects. Lower values trigger more frequent GC.
batch_wait_timeout_ms
float
default:0
If greater than 0, the request queue might wait up to this many milliseconds to receive max_batch_size requests.
batch_wait_timeout_iters
int
default:0
Maximum number of iterations the scheduler will wait to accumulate new requests.
batch_wait_max_tokens_ratio
float
default:0
Token accumulation threshold ratio (0 to 1) for batch scheduling optimization.
enable_autotuner
bool
default:true
Enable autotuner for all tunable ops. Performance may degrade if set to false.
allreduce_strategy
Optional[str]
default:"AUTO"
Allreduce strategy to use.Options: AUTO, NCCL, UB, MINLATENCY, ONESHOT, TWOSHOT, LOWPRECISION, MNNVL, NCCL_SYMMETRIC

Torch Compile

torch_compile_config
Optional[TorchCompileConfig]
Torch compile configuration.TorchCompileConfig fields:
  • enable_fullgraph: Enable full graph compilation (default: true)
  • enable_inductor: Enable inductor backend (default: false)
  • enable_piecewise_cuda_graph: Enable piecewise CUDA graph (default: false)
  • capture_num_tokens: List of num of tokens to capture CUDA graph for
  • enable_userbuffers: Enable userbuffers (default: true)
  • max_num_streams: Max CUDA streams (default: 1)

Model Loading

load_format
Union[str, LoadFormat]
default:"AUTO"
How to load the model weights.Options:
  • "AUTO": Detect weight type from model checkpoint
  • "DUMMY": Initialize all weights randomly
  • "VISION_ONLY": Only load multimodal encoder weights
checkpoint_format
Optional[str]
The format of the provided checkpoint. Can be a custom format registered with register_checkpoint_loader.
checkpoint_loader
Optional[BaseCheckpointLoader]
Custom checkpoint loader instance. If both checkpoint_format and checkpoint_loader are provided, checkpoint_loader is ignored.

Advanced PyTorch Features

stream_interval
int
default:1
The iteration interval to create responses under streaming mode. Set higher for large batches.
enable_iter_perf_stats
bool
default:false
Enable iteration performance statistics.
enable_iter_req_stats
bool
default:false
Enable per-request stats per iteration. Must also set enable_iter_perf_stats to true.
print_iter_log
bool
default:false
Print iteration logs.
enable_layerwise_nvtx_marker
bool
default:false
Enable layerwise NVTX markers for profiling.
enable_min_latency
bool
default:false
Enable min-latency mode. Currently only used for Llama4.
force_dynamic_quantization
bool
default:false
Force dynamic quantization.
kv_connector_config
Optional[KvCacheConnectorConfig]
The config for KV cache connector.KvCacheConnectorConfig fields:
  • connector_module: Import path to connector module
  • connector_scheduler_class: Scheduler class name
  • connector_worker_class: Worker class name
mm_encoder_only
bool
default:false
Only load/execute the vision encoder part of the model.
ray_worker_extension_cls
Optional[str]
Full worker extension class name for extending RayGPUWorker functionality.
ray_placement_config
Optional[RayPlacementConfig]
Placement config for RayGPUWorker. Only used with AsyncLLM and orchestrator_type='ray'.
enable_sleep
bool
default:false
Enable LLM sleep feature. Requires extra setup that may slow down model loading.
max_stats_len
int
default:1000
The max number of performance statistic entries.
layer_wise_benchmarks_config
LayerwiseBenchmarksConfig
Configuration for layer-wise benchmarks calibration.LayerwiseBenchmarksConfig fields:
  • calibration_mode: NONE, MARK, or COLLECT
  • calibration_file_path: File path for calibration data
  • calibration_layer_indices: Layer indices to filter

TensorRT Backend Arguments (TrtLlmArgs)

These arguments are specific to the TensorRT backend.

Build Configuration

workspace
Optional[str]
The workspace directory for the model.
enable_tqdm
bool
default:false
Enable tqdm for progress bar during engine build.
fast_build
bool
default:false
Enable fast build mode.
build_config
Optional[BuildConfig]
Build configuration for TensorRT engine.BuildConfig fields: See Build Configuration for details.
enable_build_cache
Union[BuildCacheConfig, bool]
default:false
Enable build cache to reuse compiled engine components.

Quantization

quant_config
QuantConfig
Quantization configuration.QuantConfig fields:
  • quant_algo: Quantization algorithm (W8A16, W4A16, FP8, NVFP4, W4A16_AWQ, etc.)
  • kv_cache_quant_algo: KV cache quantization algorithm
  • group_size: Group size for group-wise quantization (default: 128)
  • smoothquant_val: Smoothing parameter alpha (default: 0.5)
  • clamp_val: Clamp values for FP8 rowwise quantization
  • has_zero_point: Whether to use zero point
  • exclude_modules: Module name patterns to skip in quantization
See Quantization for details.
calib_config
CalibConfig
Calibration configuration for quantization.CalibConfig fields:
  • device: Device for calibration (cuda or cpu)
  • calib_dataset: Calibration dataset name or path (default: cnn_dailymail)
  • calib_batches: Number of calibration batches (default: 512)
  • calib_batch_size: Calibration batch size (default: 1)
  • calib_max_seq_length: Max sequence length for calibration (default: 512)
  • random_seed: Random seed (default: 1234)

Embedding Configuration

embedding_parallel_mode
str
default:"SHARDING_ALONG_VOCAB"
The embedding parallel mode.Options:
  • NONE: No parallelism
  • SHARDING_ALONG_VOCAB: Shard along vocabulary dimension
  • SHARDING_ALONG_HIDDEN: Shard along hidden dimension

Prompt Adapter

enable_prompt_adapter
bool
default:false
Enable prompt adapter support.
max_prompt_adapter_token
int
default:0
The maximum number of prompt adapter tokens.

Runtime Configuration

batching_type
Optional[BatchingType]
Batching type.Options:
  • STATIC: Static batching
  • INFLIGHT: In-flight batching (continuous batching)
normalize_log_probs
bool
default:false
Normalize log probabilities.
extended_runtime_perf_knob_config
Optional[ExtendedRuntimePerfKnobConfig]
Extended runtime performance knob configuration.ExtendedRuntimePerfKnobConfig fields:
  • multi_block_mode: Whether to use multi-block mode (default: true)
  • enable_context_fmha_fp32_acc: Enable FP32 accumulation in context FMHA (default: false)
  • cuda_graph_mode: Whether to use CUDA graph mode (default: false)
  • cuda_graph_cache_size: Number of CUDA graphs to cache (default: 0)
fail_fast_on_attention_window_too_large
bool
default:false
Fail fast when attention window is too large to fit even a single sequence in the KV cache.

KV Cache Configuration

The KvCacheConfig class controls KV cache memory management.

Memory Configuration

kv_cache_config.free_gpu_memory_fraction
float
The fraction of GPU memory to allocate for the KV cache (0.0 to 1.0).Example: 0.85 = 85% of free GPU memory
kv_cache_config.max_tokens
Optional[int]
The maximum number of tokens to store in the KV cache. If both max_tokens and free_gpu_memory_fraction are specified, the minimum is used.
kv_cache_config.max_gpu_total_bytes
int
default:0
The maximum size in bytes of GPU memory for the KV cache. If both this and free_gpu_memory_fraction are specified, the minimum is allocated.
kv_cache_config.host_cache_size
Optional[int]
Size of the host (CPU) cache in bytes.

Cache Reuse

kv_cache_config.enable_block_reuse
bool
default:true
Controls if KV cache blocks can be reused for different requests.
kv_cache_config.enable_partial_reuse
bool
default:true
Whether blocks that are only partially matched can be reused.
kv_cache_config.copy_on_partial_reuse
bool
default:true
Whether partially matched blocks that are in use can be reused after copying them.

Attention Window

kv_cache_config.max_attention_window
Optional[List[int]]
Size of the attention window for each sequence. Only the last N tokens are stored.Example: [2048] keeps only the last 2048 tokens per sequence
kv_cache_config.sink_token_length
Optional[int]
Number of sink tokens (tokens to always keep in attention window).

Advanced Options

kv_cache_config.tokens_per_block
int
default:32
The number of tokens per block in paged attention.
kv_cache_config.onboard_blocks
bool
default:true
Controls if blocks are onboarded.
kv_cache_config.use_uvm
bool
default:false
Whether to use UVM (Unified Virtual Memory) for the KV cache.
kv_cache_config.dtype
str
default:"auto"
The data type to use for the KV cache.Options: auto, fp8, nvfp4, or valid torch dtype stringsNote: This is a PyTorch backend only field.
kv_cache_config.cross_kv_cache_fraction
Optional[float]
The fraction of KV Cache memory reserved for cross attention (encoder-decoder models). Default is 50%.
kv_cache_config.event_buffer_max_size
int
default:0
Maximum size of the event buffer. If set to 0, the event buffer is not used.
kv_cache_config.use_kv_cache_manager_v2
bool
default:false
Whether to use the KV cache manager v2 (experimental).
kv_cache_config.max_util_for_resume
float
The maximum utilization of the KV cache for resume (0.0 to 1.0). Only used with KV cache manager v2.

Example Configurations

Basic LLM Setup

from tensorrt_llm import LLM
from tensorrt_llm.llmapi import LlmArgs

args = LlmArgs(
    model="meta-llama/Llama-2-7b-hf",
    dtype="bfloat16",
    max_batch_size=128,
    max_input_len=2048,
    max_num_tokens=4096
)

llm = LLM(args)

Tensor Parallel Setup

args = LlmArgs(
    model="meta-llama/Llama-2-70b-hf",
    tensor_parallel_size=4,
    dtype="bfloat16"
)

With Speculative Decoding (EAGLE)

from tensorrt_llm.llmapi import Eagle3DecodingConfig

args = LlmArgs(
    model="meta-llama/Llama-3.1-8B-Instruct",
    speculative_config=Eagle3DecodingConfig(
        max_draft_len=4,
        speculative_model="yuhuili/EAGLE3-LLaMA3.1-Instruct-8B"
    )
)

With Quantization

from tensorrt_llm.llmapi import QuantConfig, QuantAlgo

args = LlmArgs(
    model="meta-llama/Llama-2-7b-hf",
    quant_config=QuantConfig(
        quant_algo=QuantAlgo.FP8
    )
)

Custom KV Cache Configuration

from tensorrt_llm.llmapi import KvCacheConfig

args = LlmArgs(
    model="meta-llama/Llama-2-7b-hf",
    kv_cache_config=KvCacheConfig(
        free_gpu_memory_fraction=0.8,
        enable_block_reuse=True,
        max_attention_window=[4096]
    )
)

Build docs developers (and LLMs) love