LLM Arguments Configuration

This page documents all configuration options available through LlmArgs (PyTorch backend) and TrtLlmArgs (TensorRT backend) classes.

Overview

LlmArgs is the main configuration class for TensorRT-LLM. It controls model loading, parallelism, quantization, KV caching, speculative decoding, and runtime behavior.

from tensorrt_llm import LLM
from tensorrt_llm.llmapi import LlmArgs, KvCacheConfig

args = LlmArgs(
    model="meta-llama/Llama-2-7b-hf",
    tensor_parallel_size=2,
    dtype="bfloat16",
    kv_cache_config=KvCacheConfig(
        free_gpu_memory_fraction=0.85
    )
)

llm = LLM(args)

Base Arguments

These arguments are common to both PyTorch and TensorRT backends.

Model and Tokenizer

model

Union[str, Path]

required

The path to the model checkpoint or the model name from the Hugging Face Hub.Examples:

"meta-llama/Llama-2-7b-hf" (HuggingFace Hub)
"/path/to/local/model" (Local directory)

tokenizer

Optional[Union[str, Path, TokenizerBase, PreTrainedTokenizerBase]]

The path to the tokenizer checkpoint or the tokenizer name from the Hugging Face Hub. If not specified, uses the model path.

tokenizer_mode

Literal['auto', 'slow']

default:"auto"

The mode to initialize the tokenizer.

auto: Use fast tokenizer if available
slow: Force slow tokenizer

custom_tokenizer

Optional[str]

Specify a custom tokenizer implementation. Accepts either:

A built-in alias (e.g., 'deepseek_v32')
A Python import path (e.g., 'tensorrt_llm.tokenizer.deepseek_v32.DeepseekV32Tokenizer')

The tokenizer class must implement from_pretrained(path, **kwargs) and the TokenizerBase interface.

skip_tokenizer_init

bool

default:false

Whether to skip the tokenizer initialization.

trust_remote_code

bool

default:false

Whether to trust remote code when loading models from Hugging Face Hub.

dtype

str

default:"auto"

The data type to use for the model.Supported values:

"auto": Automatically select based on GPU capability
"float16": FP16
"bfloat16": BF16
"float32": FP32

revision

Optional[str]

The revision to use for the model when loading from Hugging Face Hub.

tokenizer_revision

Optional[str]

The revision to use for the tokenizer when loading from Hugging Face Hub.

model_kwargs

Optional[Dict[str, Any]]

Optional parameters overriding model config defaults.Precedence: (1) model_kwargs, (2) model config file, (3) model config class defaults. Unknown keys are ignored.

Parallelism Configuration

tensor_parallel_size

int

default:1

The tensor parallel size. Splits model layers across GPUs.Example: For a 70B model on 4 GPUs: tensor_parallel_size=4

pipeline_parallel_size

int

default:1

The pipeline parallel size. Splits model layers into stages.Example: For a 70B model on 8 GPUs with TP=2, PP=4: tensor_parallel_size=2, pipeline_parallel_size=4

context_parallel_size

int

default:1

The context parallel size. Splits attention computation across GPUs for long sequences.

gpus_per_node

Optional[int]

default:"auto"

The number of GPUs per node. Defaults to torch.cuda.device_count().

moe_cluster_parallel_size

Optional[int]

The cluster parallel size for MoE models’s expert weights.

moe_tensor_parallel_size

Optional[int]

The tensor parallel size for MoE models’s expert weights.

moe_expert_parallel_size

Optional[int]

The expert parallel size for MoE models’s expert weights.

enable_attention_dp

bool

default:false

Enable attention data parallel.

enable_lm_head_tp_in_adp

bool

default:false

Enable LM head TP in attention dp.

pp_partition

Optional[List[int]]

Pipeline parallel partition, a list of each rank’s layer number.

cp_config

Optional[CpConfig]

Context parallel config.CpConfig fields:

cp_type: Context parallel type (default: ULYSSES)
tokens_per_block: Number of tokens per block (used in HELIX)
use_nccl_for_alltoall: Whether to use NCCL for alltoall (default: True)
fifo_version: FIFO version for alltoall (default: 2)
cp_anchor_size: Anchor size for STAR attention
block_size: Block size for STAR attention

Runtime Limits

max_batch_size

Optional[int]

default:2048

The maximum batch size for inference.

max_input_len

Optional[int]

default:1024

The maximum input length (in tokens).

max_seq_len

Optional[int]

The maximum sequence length (input + output). If not specified, computed from other constraints.

max_beam_width

Optional[int]

default:1

The maximum beam width for beam search.

max_num_tokens

Optional[int]

default:8192

The maximum number of tokens to process in a single batch.

KV Cache Configuration

kv_cache_config

KvCacheConfig

KV cache configuration. See KV Cache Configuration section below.

enable_chunked_prefill

bool

default:false

Enable chunked prefill. Splits long prompts into chunks for better GPU utilization.

LoRA Configuration

enable_lora

bool

default:false

Enable LoRA (Low-Rank Adaptation) support.

lora_config

Optional[LoraConfig]

LoRA configuration for the model.LoraConfig fields:

max_lora_rank: Maximum LoRA rank
lora_dir: Directory containing LoRA weights
lora_target_modules: Target modules for LoRA

Speculative Decoding

speculative_config

Optional[SpeculativeConfig]

Speculative decoding configuration. Supports multiple speculation algorithms:Supported types:

DraftTargetDecodingConfig: Draft-target speculation with separate draft model
EagleDecodingConfig / Eagle3DecodingConfig: EAGLE speculation
MedusaDecodingConfig: Medusa speculation
LookaheadDecodingConfig: Lookahead speculation
MTPDecodingConfig: MTP (Multi-Token Prediction) speculation
NGramDecodingConfig: N-gram based speculation
SADecodingConfig: Suffix Automaton speculation
PARDDecodingConfig: PARD (Parallel Draft) speculation
AutoDecodingConfig: Automatically select speculation algorithm

See Speculative Decoding for details.

Scheduler Configuration

scheduler_config

SchedulerConfig

Scheduler configuration.SchedulerConfig fields:

capacity_scheduler_policy: The capacity scheduler policy (MAX_UTILIZATION, GUARANTEED_NO_EVICT, STATIC_BATCH)
context_chunking_policy: Context chunking policy (FIRST_COME_FIRST_SERVED, EQUAL_PROGRESS)
dynamic_batch_config: Dynamic batch configuration (TensorRT backend only)
waiting_queue_policy: Waiting queue scheduling policy (default: FCFS)

Advanced Configuration

peft_cache_config

Optional[PeftCacheConfig]

PEFT (Parameter-Efficient Fine-Tuning) cache configuration for LoRA adapters.PeftCacheConfig fields:

num_host_module_layer: Number of LoRA weights sets in host cache
num_device_module_layer: Number of LoRA weights sets in device cache
optimal_adapter_size: Optimal adapter size for page width (default: 8)
max_adapter_size: Max supported adapter size (default: 64)
device_cache_percent: GPU memory fraction for device cache (default: 0.02)
host_cache_size: Host cache size in bytes (default: 1GB)

cache_transceiver_config

Optional[CacheTransceiverConfig]

Cache transceiver configuration for disaggregated serving.CacheTransceiverConfig fields:

backend: Communication backend (DEFAULT, UCX, NIXL, MOONCAKE, MPI)
transceiver_runtime: Runtime implementation (CPP, PYTHON)
max_tokens_in_buffer: Max tokens in transfer buffer
kv_transfer_timeout_ms: Timeout for KV cache transfer

sparse_attention_config

Optional[SparseAttentionConfig]

Sparse attention configuration.Supported types:

RocketSparseAttentionConfig: RocketKV sparse attention
DeepSeekSparseAttentionConfig: DeepSeek sparse attention
SkipSoftmaxAttentionConfig: Skip softmax attention

guided_decoding_backend

Optional[Literal['xgrammar', 'llguidance']]

Guided decoding backend. llguidance is supported in PyTorch backend only.

batched_logits_processor

Optional[BatchedLogitsProcessor]

Batched logits processor for custom token generation control.

gather_generation_logits

bool

default:false

Gather generation logits.

Orchestration

orchestrator_type

Optional[Literal['rpc', 'ray']]

The orchestrator type to use. Defaults to None, which uses MPI.

env_overrides

Optional[Dict[str, str]]

Environment variable overrides.Note: Import-time-cached env vars in the code won’t update unless the code fetches them from os.environ on demand.

Performance and Monitoring

iter_stats_max_iterations

Optional[int]

The maximum number of iterations for iter stats.

request_stats_max_iterations

Optional[int]

The maximum number of iterations for request stats.

return_perf_metrics

bool

default:false

Return performance metrics.

perf_metrics_max_requests

int

default:0

The maximum number of requests for perf metrics. Must also set return_perf_metrics to true.

otlp_traces_endpoint

Optional[str]

Target URL to which OpenTelemetry traces will be sent.

Postprocessing

num_postprocess_workers

int

default:0

The number of processes used for postprocessing the generated tokens, including detokenization.

postprocess_tokenizer_dir

Optional[str]

The path to the tokenizer directory for postprocessing.

reasoning_parser

Optional[str]

The parser to separate reasoning content from output.

PyTorch Backend Arguments (TorchLlmArgs)

These arguments are specific to the PyTorch backend (backend="pytorch").

CUDA Graph Configuration

cuda_graph_config

Optional[CudaGraphConfig]

CUDA graph configuration for performance optimization.CudaGraphConfig fields:

batch_sizes: List of batch sizes to create CUDA graphs for
max_batch_size: Maximum batch size for CUDA graphs (default: 0)
enable_padding: Round batches up to nearest cuda_graph_batch_size (default: false)

Example:

CudaGraphConfig(
    max_batch_size=128,
    enable_padding=True
)

MoE Configuration

moe_config

MoeConfig

Mixture of Experts configuration.MoeConfig fields:

backend: MoE backend (AUTO, CUTLASS, CUTEDSL, WIDEEP, TRTLLM, DEEPGEMM, VANILLA, TRITON)
max_num_tokens: Max tokens sent to MoE at once
load_balancer: MoE load balancing configuration
disable_finalize_fusion: Disable FC2+finalize fusion (default: false)
use_low_precision_moe_combine: Use low precision combine for NVFP4 (default: false)

Quantization Configuration

nvfp4_gemm_config

Nvfp4GemmConfig

NVFP4 GEMM backend configuration.Nvfp4GemmConfig fields:

allowed_backends: List of backends for auto-selection (default: ['cutlass', 'cublaslt', 'cuda_core'])

Attention Configuration

attn_backend

str

default:"TRTLLM"

Attention backend to use.Supported values:

"TRTLLM": TensorRT-LLM attention kernels
Other backend-specific options

attention_dp_config

Optional[AttentionDpConfig]

Optimized load-balancing for the DP Attention scheduler.AttentionDpConfig fields:

enable_balance: Whether to enable balance (default: false)
timeout_iters: Number of iterations to timeout (default: 50)
batching_wait_iters: Number of iterations to wait for batching (default: 10)

disable_overlap_scheduler

bool

default:false

Disable the overlap scheduler.

Sampling Configuration

sampler_type

Union[str, SamplerType]

default:"auto"

The type of sampler to use.Options:

"TRTLLMSampler": TensorRT-LLM sampler
"TorchSampler": PyTorch native sampler
"auto": Automatically select (uses TorchSampler unless BeamSearch is requested)

sampler_force_async_worker

bool

default:false

Force usage of the async worker in the sampler for D2H copies.

Performance Tuning

garbage_collection_gen0_threshold

int

default:20000

Threshold for Python garbage collection of generation 0 objects. Lower values trigger more frequent GC.

batch_wait_timeout_ms

float

default:0

If greater than 0, the request queue might wait up to this many milliseconds to receive max_batch_size requests.

batch_wait_timeout_iters

int

default:0

Maximum number of iterations the scheduler will wait to accumulate new requests.

batch_wait_max_tokens_ratio

float

default:0

Token accumulation threshold ratio (0 to 1) for batch scheduling optimization.

enable_autotuner

bool

default:true

Enable autotuner for all tunable ops. Performance may degrade if set to false.

allreduce_strategy

Optional[str]

default:"AUTO"

Allreduce strategy to use.Options: AUTO, NCCL, UB, MINLATENCY, ONESHOT, TWOSHOT, LOWPRECISION, MNNVL, NCCL_SYMMETRIC

Torch Compile

torch_compile_config

Optional[TorchCompileConfig]

Torch compile configuration.TorchCompileConfig fields:

enable_fullgraph: Enable full graph compilation (default: true)
enable_inductor: Enable inductor backend (default: false)
enable_piecewise_cuda_graph: Enable piecewise CUDA graph (default: false)
capture_num_tokens: List of num of tokens to capture CUDA graph for
enable_userbuffers: Enable userbuffers (default: true)
max_num_streams: Max CUDA streams (default: 1)

Model Loading

load_format

Union[str, LoadFormat]

default:"AUTO"

How to load the model weights.Options:

"AUTO": Detect weight type from model checkpoint
"DUMMY": Initialize all weights randomly
"VISION_ONLY": Only load multimodal encoder weights

checkpoint_format

Optional[str]

The format of the provided checkpoint. Can be a custom format registered with register_checkpoint_loader.

checkpoint_loader

Optional[BaseCheckpointLoader]

Custom checkpoint loader instance. If both checkpoint_format and checkpoint_loader are provided, checkpoint_loader is ignored.

Advanced PyTorch Features

stream_interval

int

default:1

The iteration interval to create responses under streaming mode. Set higher for large batches.

enable_iter_perf_stats

bool

default:false

Enable iteration performance statistics.

enable_iter_req_stats

bool

default:false

Enable per-request stats per iteration. Must also set enable_iter_perf_stats to true.

print_iter_log

bool

default:false

Print iteration logs.

enable_layerwise_nvtx_marker

bool

default:false

Enable layerwise NVTX markers for profiling.

enable_min_latency

bool

default:false

Enable min-latency mode. Currently only used for Llama4.

force_dynamic_quantization

bool

default:false

Force dynamic quantization.

kv_connector_config

Optional[KvCacheConnectorConfig]

The config for KV cache connector.KvCacheConnectorConfig fields:

connector_module: Import path to connector module
connector_scheduler_class: Scheduler class name
connector_worker_class: Worker class name

mm_encoder_only

bool

default:false

Only load/execute the vision encoder part of the model.

ray_worker_extension_cls

Optional[str]

Full worker extension class name for extending RayGPUWorker functionality.

ray_placement_config

Optional[RayPlacementConfig]

Placement config for RayGPUWorker. Only used with AsyncLLM and orchestrator_type='ray'.

enable_sleep

bool

default:false

Enable LLM sleep feature. Requires extra setup that may slow down model loading.

max_stats_len

int

default:1000

The max number of performance statistic entries.

layer_wise_benchmarks_config

LayerwiseBenchmarksConfig

Configuration for layer-wise benchmarks calibration.LayerwiseBenchmarksConfig fields:

calibration_mode: NONE, MARK, or COLLECT
calibration_file_path: File path for calibration data
calibration_layer_indices: Layer indices to filter

TensorRT Backend Arguments (TrtLlmArgs)

These arguments are specific to the TensorRT backend.

Build Configuration

workspace

Optional[str]

The workspace directory for the model.

enable_tqdm

bool

default:false

Enable tqdm for progress bar during engine build.

fast_build

bool

default:false

Enable fast build mode.

build_config

Optional[BuildConfig]

Build configuration for TensorRT engine.BuildConfig fields: See Build Configuration for details.

enable_build_cache

Union[BuildCacheConfig, bool]

default:false

Enable build cache to reuse compiled engine components.

Quantization

quant_config

QuantConfig

Quantization configuration.QuantConfig fields:

quant_algo: Quantization algorithm (W8A16, W4A16, FP8, NVFP4, W4A16_AWQ, etc.)
kv_cache_quant_algo: KV cache quantization algorithm
group_size: Group size for group-wise quantization (default: 128)
smoothquant_val: Smoothing parameter alpha (default: 0.5)
clamp_val: Clamp values for FP8 rowwise quantization
has_zero_point: Whether to use zero point
exclude_modules: Module name patterns to skip in quantization

See Quantization for details.

calib_config

CalibConfig

Calibration configuration for quantization.CalibConfig fields:

device: Device for calibration (cuda or cpu)
calib_dataset: Calibration dataset name or path (default: cnn_dailymail)
calib_batches: Number of calibration batches (default: 512)
calib_batch_size: Calibration batch size (default: 1)
calib_max_seq_length: Max sequence length for calibration (default: 512)
random_seed: Random seed (default: 1234)

Embedding Configuration

embedding_parallel_mode

str

default:"SHARDING_ALONG_VOCAB"

The embedding parallel mode.Options:

NONE: No parallelism
SHARDING_ALONG_VOCAB: Shard along vocabulary dimension
SHARDING_ALONG_HIDDEN: Shard along hidden dimension

Prompt Adapter

enable_prompt_adapter

bool

default:false

Enable prompt adapter support.

max_prompt_adapter_token

int

default:0

The maximum number of prompt adapter tokens.

Runtime Configuration

batching_type

Optional[BatchingType]

Batching type.Options:

STATIC: Static batching
INFLIGHT: In-flight batching (continuous batching)

normalize_log_probs

bool

default:false

Normalize log probabilities.

extended_runtime_perf_knob_config

Optional[ExtendedRuntimePerfKnobConfig]

Extended runtime performance knob configuration.ExtendedRuntimePerfKnobConfig fields:

multi_block_mode: Whether to use multi-block mode (default: true)
enable_context_fmha_fp32_acc: Enable FP32 accumulation in context FMHA (default: false)
cuda_graph_mode: Whether to use CUDA graph mode (default: false)
cuda_graph_cache_size: Number of CUDA graphs to cache (default: 0)

fail_fast_on_attention_window_too_large

bool

default:false

Fail fast when attention window is too large to fit even a single sequence in the KV cache.

KV Cache Configuration

The KvCacheConfig class controls KV cache memory management.

Memory Configuration

kv_cache_config.free_gpu_memory_fraction

float

The fraction of GPU memory to allocate for the KV cache (0.0 to 1.0).Example: 0.85 = 85% of free GPU memory

kv_cache_config.max_tokens

Optional[int]

The maximum number of tokens to store in the KV cache. If both max_tokens and free_gpu_memory_fraction are specified, the minimum is used.

kv_cache_config.max_gpu_total_bytes

int

default:0

The maximum size in bytes of GPU memory for the KV cache. If both this and free_gpu_memory_fraction are specified, the minimum is allocated.

kv_cache_config.host_cache_size

Optional[int]

Size of the host (CPU) cache in bytes.

Cache Reuse

kv_cache_config.enable_block_reuse

bool

default:true

Controls if KV cache blocks can be reused for different requests.

kv_cache_config.enable_partial_reuse

bool

default:true

Whether blocks that are only partially matched can be reused.

kv_cache_config.copy_on_partial_reuse

bool

default:true

Whether partially matched blocks that are in use can be reused after copying them.

Attention Window

kv_cache_config.max_attention_window

Optional[List[int]]

Size of the attention window for each sequence. Only the last N tokens are stored.Example: [2048] keeps only the last 2048 tokens per sequence

kv_cache_config.sink_token_length

Optional[int]

Number of sink tokens (tokens to always keep in attention window).

Advanced Options

kv_cache_config.tokens_per_block

int

default:32

The number of tokens per block in paged attention.

kv_cache_config.onboard_blocks

bool

default:true

Controls if blocks are onboarded.

kv_cache_config.use_uvm

bool

default:false

Whether to use UVM (Unified Virtual Memory) for the KV cache.

kv_cache_config.dtype

str

default:"auto"

The data type to use for the KV cache.Options: auto, fp8, nvfp4, or valid torch dtype stringsNote: This is a PyTorch backend only field.

kv_cache_config.cross_kv_cache_fraction

Optional[float]

The fraction of KV Cache memory reserved for cross attention (encoder-decoder models). Default is 50%.

kv_cache_config.event_buffer_max_size

int

default:0

Maximum size of the event buffer. If set to 0, the event buffer is not used.

kv_cache_config.use_kv_cache_manager_v2

bool

default:false

Whether to use the KV cache manager v2 (experimental).

kv_cache_config.max_util_for_resume

float

The maximum utilization of the KV cache for resume (0.0 to 1.0). Only used with KV cache manager v2.

Example Configurations

Basic LLM Setup

from tensorrt_llm import LLM
from tensorrt_llm.llmapi import LlmArgs

args = LlmArgs(
    model="meta-llama/Llama-2-7b-hf",
    dtype="bfloat16",
    max_batch_size=128,
    max_input_len=2048,
    max_num_tokens=4096
)

llm = LLM(args)

Tensor Parallel Setup

args = LlmArgs(
    model="meta-llama/Llama-2-70b-hf",
    tensor_parallel_size=4,
    dtype="bfloat16"
)

With Speculative Decoding (EAGLE)

from tensorrt_llm.llmapi import Eagle3DecodingConfig

args = LlmArgs(
    model="meta-llama/Llama-3.1-8B-Instruct",
    speculative_config=Eagle3DecodingConfig(
        max_draft_len=4,
        speculative_model="yuhuili/EAGLE3-LLaMA3.1-Instruct-8B"
    )
)

With Quantization

from tensorrt_llm.llmapi import QuantConfig, QuantAlgo

args = LlmArgs(
    model="meta-llama/Llama-2-7b-hf",
    quant_config=QuantConfig(
        quant_algo=QuantAlgo.FP8
    )
)

Custom KV Cache Configuration

from tensorrt_llm.llmapi import KvCacheConfig

args = LlmArgs(
    model="meta-llama/Llama-2-7b-hf",
    kv_cache_config=KvCacheConfig(
        free_gpu_memory_fraction=0.8,
        enable_block_reuse=True,
        max_attention_window=[4096]
    )
)

Python API

CLI Tools

Configuration

​LLM Arguments Configuration

​Overview

​Base Arguments

​Model and Tokenizer

​Parallelism Configuration

​Runtime Limits

​KV Cache Configuration

​LoRA Configuration

​Speculative Decoding

​Scheduler Configuration

​Advanced Configuration

​Orchestration

​Performance and Monitoring

​Postprocessing

​PyTorch Backend Arguments (TorchLlmArgs)

​CUDA Graph Configuration

​MoE Configuration

​Quantization Configuration

​Attention Configuration

​Sampling Configuration

​Performance Tuning

​Torch Compile

​Model Loading

​Advanced PyTorch Features

​TensorRT Backend Arguments (TrtLlmArgs)

​Build Configuration

​Quantization

​Embedding Configuration

​Prompt Adapter

​Runtime Configuration

​KV Cache Configuration

​Memory Configuration

​Cache Reuse

​Attention Window

​Advanced Options

​Example Configurations

​Basic LLM Setup

​Tensor Parallel Setup

​With Speculative Decoding (EAGLE)

​With Quantization

​Custom KV Cache Configuration

Build docs developers (and LLMs) love

LLM Arguments Configuration

Overview

Base Arguments

Model and Tokenizer

Parallelism Configuration

Runtime Limits

KV Cache Configuration

LoRA Configuration

Speculative Decoding

Scheduler Configuration

Advanced Configuration

Orchestration

Performance and Monitoring

Postprocessing

PyTorch Backend Arguments (TorchLlmArgs)

CUDA Graph Configuration

MoE Configuration

Quantization Configuration

Attention Configuration

Sampling Configuration

Performance Tuning

Torch Compile

Model Loading

Advanced PyTorch Features

TensorRT Backend Arguments (TrtLlmArgs)

Build Configuration

Quantization

Embedding Configuration

Prompt Adapter

Runtime Configuration

KV Cache Configuration

Memory Configuration

Cache Reuse

Attention Window

Advanced Options

Example Configurations

Basic LLM Setup

Tensor Parallel Setup

With Speculative Decoding (EAGLE)

With Quantization

Custom KV Cache Configuration