Model Architecture Support
The following table lists all supported model architectures with links to example models on HuggingFace:Decoder-Only Language Models
LLaMA Family
LLaMA Family
| Architecture | Models | HuggingFace Example |
|---|---|---|
LlamaForCausalLM | Llama 3.1, Llama 3, Llama 2, LLaMA | meta-llama/Meta-Llama-3.1-70B |
Llama4ForConditionalGeneration | Llama 4 | meta-llama/Llama-4-Scout-17B-16E-Instruct |
MllamaForConditionalGeneration | Llama 3.2 (Vision) | meta-llama/Llama-3.2-11B-Vision |
- Full tensor parallelism support
- RoPE (Rotary Position Embedding)
- GQA (Grouped Query Attention)
- FP8/INT8 quantization support
GPT Family
GPT Family
| Architecture | Models | HuggingFace Example |
|---|---|---|
GPTForCausalLM | GPT-2, GPT-3, GPT-J | Various GPT models |
GPTJForCausalLM | GPT-J | EleutherAI/gpt-j-6b |
GPTNeoXForCausalLM | GPT-NeoX | EleutherAI/gpt-neox-20b |
GptOssForCausalLM | GPT-OSS | openai/gpt-oss-120b |
- Standard attention mechanisms
- Position embeddings
- Dense feedforward layers
Mistral & Mixtral
Mistral & Mixtral
| Architecture | Models | HuggingFace Example |
|---|---|---|
MistralForCausalLM | Mistral 7B, Mistral variants | mistralai/Mistral-7B-v0.1 |
MixtralForCausalLM | Mixtral MoE models | mistralai/Mixtral-8x7B-v0.1 |
Mistral3ForConditionalGeneration | Mistral 3 (Multimodal) | Mistral 3 vision models |
- Sliding window attention
- Sparse Mixture of Experts (MoE)
- RoPE embeddings
Qwen Family
Qwen Family
| Architecture | Models | HuggingFace Example |
|---|---|---|
Qwen2ForCausalLM | QwQ, Qwen2 | Qwen/Qwen2-7B-Instruct |
Qwen3ForCausalLM | Qwen3 | Qwen/Qwen3-8B |
Qwen3MoeForCausalLM | Qwen3 MoE | Qwen/Qwen3-30B-A3B |
Qwen3NextForCausalLM | Qwen3-Next | Qwen/Qwen3-Next-80B-A3B-Thinking |
Qwen3_5MoeForCausalLM | Qwen3.5-MoE | Qwen/Qwen3.5-397B-A17B |
Qwen2VLForConditionalGeneration | Qwen2-VL (Vision) | Qwen2-VL models |
Qwen2_5_VLForConditionalGeneration | Qwen2.5-VL | Qwen2.5-VL models |
Qwen3VLForConditionalGeneration | Qwen3-VL | Qwen3-VL models |
Qwen3VLMoeForConditionalGeneration | Qwen3-VL-MoE | Qwen3-VL-MoE models |
Qwen2ForProcessRewardModel- Process reward modeling:Qwen/Qwen2.5-Math-PRM-7BQwen2ForRewardModel- Reward modeling:Qwen/Qwen2.5-Math-RM-72B
Gemma Family
Gemma Family
| Architecture | Models | HuggingFace Example |
|---|---|---|
GemmaForCausalLM | Gemma, Gemma 2 | Google Gemma models |
Gemma3ForCausalLM | Gemma 3 | google/gemma-3-1b-it |
Gemma3ForConditionalGeneration | Gemma 3 (Multimodal) | Gemma 3 vision models |
RecurrentGemmaForCausalLM | RecurrentGemma | RecurrentGemma models |
- Multi-query attention
- Sliding window patterns (Gemma 3)
- RoPE with local base frequency
DeepSeek Family
DeepSeek Family
| Architecture | Models | HuggingFace Example |
|---|---|---|
DeepseekForCausalLM | DeepSeek v1 | DeepSeek v1 models |
DeepseekV2ForCausalLM | DeepSeek v2 | DeepSeek v2 models |
DeepseekV3ForCausalLM | DeepSeek-V3 | deepseek-ai/DeepSeek-V3 |
DeepseekV32ForCausalLM | DeepSeek-V3.2 | deepseek-ai/DeepSeek-V3.2 |
- Multi-head Latent Attention (MLA)
- Sparse Mixture of Experts
- FP8 support with specialized kernels
Phi Family
Phi Family
| Architecture | Models | HuggingFace Example |
|---|---|---|
PhiForCausalLM | Phi-1, Phi-2 | Microsoft Phi models |
Phi3ForCausalLM | Phi-3, Phi-4 | microsoft/Phi-4 |
Phi4MMForCausalLM | Phi-4 Multimodal | Phi-4 vision/audio models |
Other Decoder Models
Other Decoder Models
| Architecture | Models | HuggingFace Example |
|---|---|---|
BaichuanForCausalLM | Baichuan, Baichuan2 | Baichuan models |
BloomForCausalLM | BLOOM | bigscience/bloom |
ChatGLMForCausalLM | ChatGLM, ChatGLM2, ChatGLM3 | THUDM ChatGLM models |
CohereForCausalLM | Command R, Command R+ | Cohere models |
CogVLMForCausalLM | CogVLM (Vision) | CogVLM models |
DbrxForCausalLM | DBRX | Databricks DBRX models |
FalconForCausalLM | Falcon | TII Falcon models |
GrokForCausalLM | Grok | xAI Grok models |
MambaForCausalLM | Mamba | State-space models |
MPTForCausalLM | MPT | MosaicML MPT models |
OPTForCausalLM | OPT | Meta OPT models |
NVIDIA Nemotron Family
| Architecture | Models | HuggingFace Example |
|---|---|---|
NemotronForCausalLM | Nemotron-3, Nemotron-4, Minitron | nvidia/Minitron-8B-Base |
NemotronHForCausalLM | Nemotron-3-Nano | nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 |
NemotronNASForCausalLM | NemotronNAS | nvidia/Llama-3_3-Nemotron-Super-49B-v1 |
DeciLMForCausalLM | Nemotron (DeciLM) | nvidia/Llama-3_1-Nemotron-51B-Instruct |
NemotronH_Nano_VL_V2 | Nemotron Nano VL (Vision) | Nemotron vision models |
GLM Family
| Architecture | Models | HuggingFace Example |
|---|---|---|
Glm4MoeForCausalLM | GLM-4.5, GLM-4.6, GLM-4.7 | THUDM/GLM-4-100B-A10B |
Glm4MoeLiteForCausalLM | GLM-4.7-Flash | zai-org/GLM-4.7-Flash |
Other MoE Models
| Architecture | Models | HuggingFace Example |
|---|---|---|
MiniMaxM2ForCausalLM | MiniMax M2/M2.1 | MiniMaxAI/MiniMax-M2 |
ExaoneMoEForCausalLM | K-EXAONE | LGAI-EXAONE/K-EXAONE-236B-A23B |
Exaone4ForCausalLM | EXAONE 4.0 | LGAI-EXAONE/EXAONE-4.0-32B |
Encoder-Decoder Models
| Architecture | Models | HuggingFace Example |
|---|---|---|
BertForSequenceClassification | BERT-based classifiers | textattack/bert-base-uncased-yelp-polarity |
BertForQuestionAnswering | BERT-based QA | BERT QA models |
RobertaForSequenceClassification | RoBERTa classifiers | RoBERTa models |
WhisperEncoder | Whisper (Speech) | OpenAI Whisper models |
Speculative Decoding Models
| Architecture | Models | Purpose |
|---|---|---|
MedusaForCausalLm | Medusa | Multi-head speculative decoding |
EagleForCausalLM | EAGLE | Efficient speculative sampling |
ReDrafterForLLaMALM | ReDrafter | Draft model for LLaMA |
ReDrafterForQWenLM | ReDrafter | Draft model for Qwen |
Vision & Multimodal Models
| Architecture | Modalities | Models |
|---|---|---|
LlavaNextForConditionalGeneration | Language + Image | LLaVA-NeXT models |
LlavaLlamaModel | Language + Image + Video | VILA models |
CLIPVisionTransformer | Vision Encoder | CLIP models |
HCXVisionForCausalLM | Language + Image | HCX Vision models |
Diffusion Models
| Architecture | Purpose | Models |
|---|---|---|
DiT | Diffusion Transformer | DiT models |
SD3Transformer2DModel | Stable Diffusion 3 | SD3 models |
STDiT3Model | Spatiotemporal DiT | Video generation models |
UNet | Diffusion UNet | Various UNet models |
Feature Support Matrix
Support for features may vary by model architecture. The table below shows support for key production models.
Key LLM Models
| Model Architecture | Overlap Scheduler | CUDA Graph | Attention DP | Disaggregated Serving | Chunked Prefill | MTP |
|---|---|---|---|---|---|---|
DeepseekV3ForCausalLM | ✓ | ✓ | ✓ | ✓ | ✓* | ✓ |
DeepseekV32ForCausalLM | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Glm4MoeForCausalLM | ✓ | ✓ | ✓ | Untested | ✓ | ✓ |
Qwen3MoeForCausalLM | ✓ | ✓ | ✓ | ✓ | ✓ | - |
Qwen3NextForCausalLM | ✓ | ✓ | - | Untested | ✓ | - |
Llama4ForConditionalGeneration | ✓ | ✓ | ✓ | ✓ | ✓ | - |
GptOssForCausalLM | ✓ | ✓ | ✓ | ✓ | ✓ | - |
Chunked Prefill for MLA: DeepSeek V3’s Multi-head Latent Attention can only enable chunked prefill on SM100/SM103 GPUs.KV Cache Reuse: For MLA architectures, KV cache reuse requires SM90/SM100/SM103 GPUs and BF16/FP8 KV cache dtype.
Multimodal Models
| Model Architecture | Modality | Overlap Scheduler | CUDA Graph | Chunked Prefill | EPD Disaggregated Serving |
|---|---|---|---|---|---|
Gemma3ForConditionalGeneration | L + I | ✓ | ✓ | N/A | - |
HCXVisionForCausalLM | L + I | ✓ | ✓ | - | - |
LlavaNextForConditionalGeneration | L + I | ✓ | ✓ | ✓ | ✓ |
Llama4ForConditionalGeneration | L + I | ✓ | ✓ | - | - |
Mistral3ForConditionalGeneration | L + I | ✓ | ✓ | ✓ | - |
Phi4MMForCausalLM | L + I + A | ✓ | ✓ | ✓ | - |
Qwen2VLForConditionalGeneration | L + I + V | ✓ | ✓ | ✓ | - |
Qwen2_5_VLForConditionalGeneration | L + I + V | ✓ | ✓ | ✓ | ✓ |
Qwen3VLForConditionalGeneration | L + I + V | ✓ | ✓ | ✓ | ✓ |
Qwen3VLMoeForConditionalGeneration | L + I + V | ✓ | ✓ | ✓ | ✓ |
LlavaLlamaModel (VILA) | L + I + V | ✓ | ✓ | - | - |
NemotronH_Nano_VL_V2 | L + I + V | ✓ | ✓ | ✓ | - |
- L: Language (text)
- I: Image
- V: Video
- A: Audio
Model Implementation Location
All model implementations are located in the TensorRT-LLM source:config.py- Model configuration classmodel.py- Model architecture implementationconvert.py- Weight conversion utilities
Using Supported Models
Next Steps
Model Configuration
Learn how to configure models with custom parameters
Custom Models
Add your own custom model architectures
Quantization
Optimize models with INT8, FP8, and INT4 quantization
Deployment Guide
Deploy models to production