Model Configuration
This page documents thePretrainedConfig class, which defines model architecture parameters and configuration for TensorRT-LLM models.
Overview
ThePretrainedConfig class is the base configuration class for all TensorRT-LLM models. It contains architecture-specific parameters that define the model structure, such as layer counts, hidden dimensions, attention configuration, and more.
Model-specific config classes (like LlamaConfig, GPTConfig, etc.) inherit from PretrainedConfig and add model-specific parameters.
Location
Source file:tensorrt_llm/models/modeling_utils.py:346-548
Constructor Parameters
Core Architecture Parameters
The model architecture type.Examples:
"LlamaForCausalLM""GPTForCausalLM""MixtralForCausalLM"
AutoConfig.from_hugging_face().The data type for model weights and activations.Supported values:
"float16": FP16"bfloat16": BF16"float32": FP32
Dimensionality of the model’s hidden layers.Example:
4096 for Llama-2-7BNumber of transformer layers in the model.Example:
32 for Llama-2-7BNumber of attention heads in each layer.Example:
32 for Llama-2-7BSize of the vocabulary.Example:
32000 for Llama-2Activation and Normalization
The activation function used in the feed-forward layers.Common values:
"gelu": Gaussian Error Linear Unit"silu": Sigmoid Linear Unit (used in Llama)"relu": Rectified Linear Unit
The epsilon value for layer normalization.
The data type for output logits.
Position Embeddings
The type of position embedding to use.Options:
"learned_absolute": Learned absolute position embeddings"rope_gpt_neox": Rotary Position Embeddings (GPT-NeoX style)"rope_gptj": Rotary Position Embeddings (GPT-J style)"alibi": Attention with Linear Biases"alibi_with_scale": ALiBi with learned scale"relative": Relative position embeddings
The maximum sequence length that the model can handle.Example:
4096 for Llama-2The dimensionality of rotary position embeddings.Default calculation: If not specified, computed as
head_size * rotary_pct where rotary_pct defaults to 1.0.Attention Configuration
Number of key-value heads for Grouped Query Attention (GQA).Default: If not specified, equals
num_attention_heads (Multi-Head Attention).Examples:num_key_value_heads == num_attention_heads: Multi-Head Attention (MHA)num_key_value_heads < num_attention_heads: Grouped Query Attention (GQA)num_key_value_heads == 1: Multi-Query Attention (MQA)
The dimension of each attention head.Default calculation: If not specified, computed as
hidden_size // num_attention_heads.Whether to apply layer normalization to queries and keys in attention.
Feed-Forward Network
The dimensionality of the feed-forward network’s intermediate layer.Default calculation: If not specified, computed as
hidden_size * 4.Example: 11008 for Llama-2-7BParallel Configuration
The parallel mapping configuration.Mapping fields:
world_size: Total number of GPUsrank: Current GPU ranktp_size: Tensor parallel sizepp_size: Pipeline parallel sizecp_size: Context parallel sizegpus_per_node: Number of GPUs per node
Quantization Configuration
Quantization configuration for the model.QuantConfig fields:
quant_algo: Quantization algorithmkv_cache_quant_algo: KV cache quantization algorithmgroup_size: Group size for group-wise quantizationsmoothquant_val: Smoothing parameterexclude_modules: Modules to exclude from quantization
Embedding Configuration
Whether to use parallel embedding tables (sharded across GPUs).
The dimension along which to shard the embedding table.Options:
0: Shard along vocabulary dimension1: Shard along hidden dimension
Runtime Defaults
Default runtime configuration values.RuntimeDefaults fields:
- KV cache defaults
- Scheduling defaults
- Performance knob defaults
Properties
Quantization Mode
The quantization mode derived from the quantization configuration.Accessed via:
The quantization algorithm.Accessed via:
KV Cache Data Type
The data type for KV cache.Returns:
"int8": If using INT8 KV cache quantization"fp8": If using FP8 KV cache quantization"fp4": If using FP4 KV cache quantizationconfig.dtype: Otherwise (same as model dtype)
Methods
Loading and Saving
Create a
PretrainedConfig from a dictionary.Parameters:config(dict): Configuration dictionary
PretrainedConfig instanceExample:Load a
PretrainedConfig from a JSON file.Parameters:config_file(str): Path to the config.json file
PretrainedConfig instanceExample:Load a
PretrainedConfig from a checkpoint directory.Parameters:ckpt_dir(str): Path to checkpoint directory
PretrainedConfig instanceExample:Convert the config to a dictionary.Returns:
dict representation of the configExample:Save the config to a JSON file.Parameters:
config_file(str): Path to save the config
Rank Management
Set the rank for this config instance.Parameters:
rank(int): The GPU rank
Iterate over all ranks, yielding a config copy for each rank.Returns: Generator yielding config instances for each rankExample:
Model-Specific Configurations
Model-specific config classes extendPretrainedConfig with additional parameters:
Llama Configuration
Mixtral Configuration
GPT Configuration
Quantization Configuration
Thequantization field accepts a QuantConfig object:
QuantConfig Fields
Quantization algorithm for weights.Options:
QuantAlgo.W8A16: INT8 weight-onlyQuantAlgo.W4A16: INT4 weight-onlyQuantAlgo.FP8: FP8 quantizationQuantAlgo.NVFP4: NVFP4 quantizationQuantAlgo.W4A16_AWQ: AWQ INT4 quantizationQuantAlgo.W8A8_SQ_PER_CHANNEL: SmoothQuant per-channel
Quantization algorithm for KV cache.Options:
QuantAlgo.INT8: INT8 KV cacheQuantAlgo.FP8: FP8 KV cacheQuantAlgo.NVFP4: NVFP4 KV cache
Group size for group-wise quantization.
Smoothing parameter alpha used in SmoothQuant.
Module name patterns that are skipped in quantization.Example:
Example Configurations
Llama-2-7B Configuration
Mixtral-8x7B Configuration
Tensor Parallel Configuration
With Quantization
Loading from HuggingFace
Most commonly, you’ll load configurations from HuggingFace models:- Downloads the model configuration
- Converts HuggingFace config to TensorRT-LLM format
- Sets appropriate defaults for the model architecture
See Also
- LLM Arguments Configuration - Runtime configuration
- Quantization - Quantization guide
- Parallelism Strategies - Parallelism guide