Configuration Overview
Every model in TensorRT-LLM uses a configuration class that inherits fromPretrainedConfig. This class defines all model architecture parameters and runtime options.
PretrainedConfig Parameters
The basePretrainedConfig class provides common parameters for all models:
Core Architecture Parameters
Model architecture name (e.g.,
"LlamaForCausalLM", "GPTForCausalLM")Model data type:
"float32", "float16", "bfloat16", "float8", "int8"Dimension of hidden layers (e.g., 4096 for LLaMA-7B)
Number of transformer layers (e.g., 32 for LLaMA-7B)
Number of attention heads (e.g., 32 for LLaMA-7B)
Size of the vocabulary (e.g., 32000 for LLaMA)
Dimension of feedforward layers (defaults to
hidden_size * 4)Number of KV heads for Grouped Query Attention (defaults to
num_attention_heads for MHA)Size of each attention head (defaults to
hidden_size / num_attention_heads)Position Embeddings
Position embedding type:
"learned_absolute"- Learned absolute positions"rope_gpt_neox"- Rotary Position Embedding (GPT-NeoX style)"rope_gptj"- RoPE (GPT-J style)"alibi"- ALiBi attention bias"alibi_with_scale"- Scaled ALiBi
Maximum sequence length supported by the model
Dimension for rotary embeddings (defaults to
head_size)Activation & Normalization
Activation function:
"gelu", "relu", "silu", "swiglu", "geglu", "fast_gelu"Epsilon for layer normalization
Data type for logits output (typically
"float32" for numerical stability)QK LayerNorm
Apply LayerNorm to query/key projections (used in some models like Gemma)
Quantization Configuration
TensorRT-LLM uses theQuantConfig class for quantization settings:
QuantConfig Parameters
Quantization algorithm:
QuantAlgo.FP8- FP8 quantizationQuantAlgo.W8A16- INT8 weight-onlyQuantAlgo.W4A16- INT4 weight-onlyQuantAlgo.W4A16_AWQ- INT4 AWQQuantAlgo.W4A8_AWQ- INT4 weights, INT8 activations (AWQ)QuantAlgo.W8A8_SQ_PER_CHANNEL- SmoothQuant per-channelQuantAlgo.NVFP4- NVIDIA FP4 formatQuantAlgo.MIXED_PRECISION- Layer-wise mixed precision
KV cache quantization (typically
QuantAlgo.FP8 or QuantAlgo.INT8)Group size for group-wise quantization (AWQ, GPTQ)
SmoothQuant alpha parameter (0.0 to 1.0)
Module name patterns to exclude from quantization (supports wildcards)Example:
["lm_head", "embed_tokens", "*.norm"]Use zero-point quantization
Clamp values for FP8 rowwise quantization
Layer-Wise Quantization
For mixed-precision quantization, useLayerQuantConfig:
Parallelism Configuration
Control tensor and pipeline parallelism with theMapping class:
Parallel Embedding
Enable embedding layer parallelism
Dimension to shard embeddings (0 for vocab dimension, 1 for hidden dimension)
Runtime Configuration
Configure runtime behavior withRuntimeDefaults:
Model-Specific Configuration
Different model families have specialized configuration classes with additional parameters:LLaMA Configuration
Gemma Configuration
GPT Configuration
DeepSeek Configuration
Loading & Saving Configuration
Configuration Best Practices
Data Type Selection
Data Type Selection
FP16 vs BF16 vs FP8:
float16- Best compatibility, good performance on most GPUsbfloat16- Better numerical stability, recommended for training and large models (requires Ampere+)float8- Highest performance on Hopper GPUs (H100, H200), requires calibration
- Use
bfloat16for models greater than 30B parameters - Use
float16for smaller models or older GPUs - Use FP8 quantization on H100/H200 for maximum throughput
Quantization Strategy
Quantization Strategy
Model Size Guidelines:
- < 7B: FP16/BF16 (no quantization needed)
- 7B - 13B: W8A16 or FP8 for memory reduction
- 30B - 70B: FP8 or W4A16_AWQ for GPU memory constraints
- > 70B: FP8 + tensor parallelism required
- Always enable KV cache quantization for long context (greater than 8k tokens)
- Use FP8 for minimal accuracy loss
- Use INT8 for maximum memory savings
Parallelism Strategy
Parallelism Strategy
Tensor Parallelism (TP):
- Use powers of 2: 2, 4, 8 GPUs
- Required when model doesn’t fit on single GPU
- Communication overhead increases with TP size
- Use when TP is insufficient
- Minimize pipeline stages (higher latency per stage)
- Best for offline batch inference
- Use for very long context windows (> 32k tokens)
- Splits attention computation across sequence dimension
Context Length Optimization
Context Length Optimization
RoPE Scaling:For extending context beyond training length:Memory Considerations:
- KV cache grows linearly with sequence length
- Enable KV cache quantization for long context
- Use paged attention for variable-length batches
Configuration Examples
Production Serving
Memory-Constrained
Next Steps
Quantization Guide
Deep dive into quantization techniques
Custom Models
Implement custom architectures
Deployment
Deploy configured models