Supported Models

TensorRT-LLM supports a wide range of model architectures for optimized LLM inference on NVIDIA GPUs. This page provides a comprehensive overview of supported models, organized by category.

Model Architecture Support

The following table lists all supported model architectures with links to example models on HuggingFace:

Decoder-Only Language Models

LLaMA Family

Architecture	Models	HuggingFace Example
`LlamaForCausalLM`	Llama 3.1, Llama 3, Llama 2, LLaMA	`meta-llama/Meta-Llama-3.1-70B`
`Llama4ForConditionalGeneration`	Llama 4	`meta-llama/Llama-4-Scout-17B-16E-Instruct`
`MllamaForConditionalGeneration`	Llama 3.2 (Vision)	`meta-llama/Llama-3.2-11B-Vision`

Features:

Full tensor parallelism support
RoPE (Rotary Position Embedding)
GQA (Grouped Query Attention)
FP8/INT8 quantization support

GPT Family

Architecture	Models	HuggingFace Example
`GPTForCausalLM`	GPT-2, GPT-3, GPT-J	Various GPT models
`GPTJForCausalLM`	GPT-J	`EleutherAI/gpt-j-6b`
`GPTNeoXForCausalLM`	GPT-NeoX	`EleutherAI/gpt-neox-20b`
`GptOssForCausalLM`	GPT-OSS	`openai/gpt-oss-120b`

Features:

Standard attention mechanisms
Position embeddings
Dense feedforward layers

Mistral & Mixtral

Architecture	Models	HuggingFace Example
`MistralForCausalLM`	Mistral 7B, Mistral variants	`mistralai/Mistral-7B-v0.1`
`MixtralForCausalLM`	Mixtral MoE models	`mistralai/Mixtral-8x7B-v0.1`
`Mistral3ForConditionalGeneration`	Mistral 3 (Multimodal)	Mistral 3 vision models

Features:

Sliding window attention
Sparse Mixture of Experts (MoE)
RoPE embeddings

Qwen Family

Architecture	Models	HuggingFace Example
`Qwen2ForCausalLM`	QwQ, Qwen2	`Qwen/Qwen2-7B-Instruct`
`Qwen3ForCausalLM`	Qwen3	`Qwen/Qwen3-8B`
`Qwen3MoeForCausalLM`	Qwen3 MoE	`Qwen/Qwen3-30B-A3B`
`Qwen3NextForCausalLM`	Qwen3-Next	`Qwen/Qwen3-Next-80B-A3B-Thinking`
`Qwen3_5MoeForCausalLM`	Qwen3.5-MoE	`Qwen/Qwen3.5-397B-A17B`
`Qwen2VLForConditionalGeneration`	Qwen2-VL (Vision)	Qwen2-VL models
`Qwen2_5_VLForConditionalGeneration`	Qwen2.5-VL	Qwen2.5-VL models
`Qwen3VLForConditionalGeneration`	Qwen3-VL	Qwen3-VL models
`Qwen3VLMoeForConditionalGeneration`	Qwen3-VL-MoE	Qwen3-VL-MoE models

Specialized Models:

Qwen2ForProcessRewardModel - Process reward modeling: Qwen/Qwen2.5-Math-PRM-7B
Qwen2ForRewardModel - Reward modeling: Qwen/Qwen2.5-Math-RM-72B

Gemma Family

Architecture	Models	HuggingFace Example
`GemmaForCausalLM`	Gemma, Gemma 2	Google Gemma models
`Gemma3ForCausalLM`	Gemma 3	`google/gemma-3-1b-it`
`Gemma3ForConditionalGeneration`	Gemma 3 (Multimodal)	Gemma 3 vision models
`RecurrentGemmaForCausalLM`	RecurrentGemma	RecurrentGemma models

Features:

Multi-query attention
Sliding window patterns (Gemma 3)
RoPE with local base frequency

DeepSeek Family

Architecture	Models	HuggingFace Example
`DeepseekForCausalLM`	DeepSeek v1	DeepSeek v1 models
`DeepseekV2ForCausalLM`	DeepSeek v2	DeepSeek v2 models
`DeepseekV3ForCausalLM`	DeepSeek-V3	`deepseek-ai/DeepSeek-V3`
`DeepseekV32ForCausalLM`	DeepSeek-V3.2	`deepseek-ai/DeepSeek-V3.2`

Features:

Multi-head Latent Attention (MLA)
Sparse Mixture of Experts
FP8 support with specialized kernels

Phi Family

Architecture	Models	HuggingFace Example
`PhiForCausalLM`	Phi-1, Phi-2	Microsoft Phi models
`Phi3ForCausalLM`	Phi-3, Phi-4	`microsoft/Phi-4`
`Phi4MMForCausalLM`	Phi-4 Multimodal	Phi-4 vision/audio models

Other Decoder Models

Architecture	Models	HuggingFace Example
`BaichuanForCausalLM`	Baichuan, Baichuan2	Baichuan models
`BloomForCausalLM`	BLOOM	`bigscience/bloom`
`ChatGLMForCausalLM`	ChatGLM, ChatGLM2, ChatGLM3	THUDM ChatGLM models
`CohereForCausalLM`	Command R, Command R+	Cohere models
`CogVLMForCausalLM`	CogVLM (Vision)	CogVLM models
`DbrxForCausalLM`	DBRX	Databricks DBRX models
`FalconForCausalLM`	Falcon	TII Falcon models
`GrokForCausalLM`	Grok	xAI Grok models
`MambaForCausalLM`	Mamba	State-space models
`MPTForCausalLM`	MPT	MosaicML MPT models
`OPTForCausalLM`	OPT	Meta OPT models

NVIDIA Nemotron Family

Architecture	Models	HuggingFace Example
`NemotronForCausalLM`	Nemotron-3, Nemotron-4, Minitron	`nvidia/Minitron-8B-Base`
`NemotronHForCausalLM`	Nemotron-3-Nano	`nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8`
`NemotronNASForCausalLM`	NemotronNAS	`nvidia/Llama-3_3-Nemotron-Super-49B-v1`
`DeciLMForCausalLM`	Nemotron (DeciLM)	`nvidia/Llama-3_1-Nemotron-51B-Instruct`
`NemotronH_Nano_VL_V2`	Nemotron Nano VL (Vision)	Nemotron vision models

GLM Family

Architecture	Models	HuggingFace Example
`Glm4MoeForCausalLM`	GLM-4.5, GLM-4.6, GLM-4.7	`THUDM/GLM-4-100B-A10B`
`Glm4MoeLiteForCausalLM`	GLM-4.7-Flash	`zai-org/GLM-4.7-Flash`

Other MoE Models

Architecture	Models	HuggingFace Example
`MiniMaxM2ForCausalLM`	MiniMax M2/M2.1	`MiniMaxAI/MiniMax-M2`
`ExaoneMoEForCausalLM`	K-EXAONE	`LGAI-EXAONE/K-EXAONE-236B-A23B`
`Exaone4ForCausalLM`	EXAONE 4.0	`LGAI-EXAONE/EXAONE-4.0-32B`

Encoder-Decoder Models

Architecture	Models	HuggingFace Example
`BertForSequenceClassification`	BERT-based classifiers	`textattack/bert-base-uncased-yelp-polarity`
`BertForQuestionAnswering`	BERT-based QA	BERT QA models
`RobertaForSequenceClassification`	RoBERTa classifiers	RoBERTa models
`WhisperEncoder`	Whisper (Speech)	OpenAI Whisper models

Speculative Decoding Models

Architecture	Models	Purpose
`MedusaForCausalLm`	Medusa	Multi-head speculative decoding
`EagleForCausalLM`	EAGLE	Efficient speculative sampling
`ReDrafterForLLaMALM`	ReDrafter	Draft model for LLaMA
`ReDrafterForQWenLM`	ReDrafter	Draft model for Qwen

Vision & Multimodal Models

Architecture	Modalities	Models
`LlavaNextForConditionalGeneration`	Language + Image	LLaVA-NeXT models
`LlavaLlamaModel`	Language + Image + Video	VILA models
`CLIPVisionTransformer`	Vision Encoder	CLIP models
`HCXVisionForCausalLM`	Language + Image	HCX Vision models

Diffusion Models

Architecture	Purpose	Models
`DiT`	Diffusion Transformer	DiT models
`SD3Transformer2DModel`	Stable Diffusion 3	SD3 models
`STDiT3Model`	Spatiotemporal DiT	Video generation models
`UNet`	Diffusion UNet	Various UNet models

Feature Support Matrix

Support for features may vary by model architecture. The table below shows support for key production models.

Key LLM Models

Model Architecture	Overlap Scheduler	CUDA Graph	Attention DP	Disaggregated Serving	Chunked Prefill	MTP
`DeepseekV3ForCausalLM`	✓	✓	✓	✓	✓*	✓
`DeepseekV32ForCausalLM`	✓	✓	✓	✓	✓	✓
`Glm4MoeForCausalLM`	✓	✓	✓	Untested	✓	✓
`Qwen3MoeForCausalLM`	✓	✓	✓	✓	✓	-
`Qwen3NextForCausalLM`	✓	✓	-	Untested	✓	-
`Llama4ForConditionalGeneration`	✓	✓	✓	✓	✓	-
`GptOssForCausalLM`	✓	✓	✓	✓	✓	-

Chunked Prefill for MLA: DeepSeek V3’s Multi-head Latent Attention can only enable chunked prefill on SM100/SM103 GPUs.KV Cache Reuse: For MLA architectures, KV cache reuse requires SM90/SM100/SM103 GPUs and BF16/FP8 KV cache dtype.

Multimodal Models

Model Architecture	Modality	Overlap Scheduler	CUDA Graph	Chunked Prefill	EPD Disaggregated Serving
`Gemma3ForConditionalGeneration`	L + I	✓	✓	N/A	-
`HCXVisionForCausalLM`	L + I	✓	✓	-	-
`LlavaNextForConditionalGeneration`	L + I	✓	✓	✓	✓
`Llama4ForConditionalGeneration`	L + I	✓	✓	-	-
`Mistral3ForConditionalGeneration`	L + I	✓	✓	✓	-
`Phi4MMForCausalLM`	L + I + A	✓	✓	✓	-
`Qwen2VLForConditionalGeneration`	L + I + V	✓	✓	✓	-
`Qwen2_5_VLForConditionalGeneration`	L + I + V	✓	✓	✓	✓
`Qwen3VLForConditionalGeneration`	L + I + V	✓	✓	✓	✓
`Qwen3VLMoeForConditionalGeneration`	L + I + V	✓	✓	✓	✓
`LlavaLlamaModel` (VILA)	L + I + V	✓	✓	-	-
`NemotronH_Nano_VL_V2`	L + I + V	✓	✓	✓	-

Modality Legend:

L: Language (text)
I: Image
V: Video
A: Audio

Model Implementation Location

All model implementations are located in the TensorRT-LLM source:

tensorrt_llm/models/
├── llama/          # LLaMA family models
├── gpt/            # GPT family models
├── gemma/          # Gemma family models
├── qwen/           # Qwen family models
├── deepseek_v1/    # DeepSeek v1
├── deepseek_v2/    # DeepSeek v2/v3
├── falcon/         # Falcon models
├── phi/            # Phi models
├── phi3/           # Phi-3/4 models
└── ...

Each model directory contains:

config.py - Model configuration class
model.py - Model architecture implementation
convert.py - Weight conversion utilities

Using Supported Models

from tensorrt_llm import LLM

# Load any supported model from HuggingFace
llm = LLM(model="meta-llama/Meta-Llama-3.1-70B")

output = llm.generate(["Hello, how are you?"])
print(output)

Next Steps

Model Configuration

Learn how to configure models with custom parameters

Custom Models

Add your own custom model architectures

Quantization

Optimize models with INT8, FP8, and INT4 quantization

Deployment Guide

Deploy models to production

Get Started

Core Concepts

Deployment

Models

Features

Performance

Model Architecture Support

Decoder-Only Language Models

NVIDIA Nemotron Family

GLM Family

Other MoE Models

Encoder-Decoder Models

Speculative Decoding Models

Vision & Multimodal Models

Diffusion Models

Feature Support Matrix

Key LLM Models

Multimodal Models

Model Implementation Location

Using Supported Models

Next Steps

Model Configuration

Custom Models

Quantization

Deployment Guide

Build docs developers (and LLMs) love

Get Started

Core Concepts

Deployment

Models

Features

Performance

​Model Architecture Support

​Decoder-Only Language Models

​NVIDIA Nemotron Family

​GLM Family

​Other MoE Models

​Encoder-Decoder Models

​Speculative Decoding Models

​Vision & Multimodal Models

​Diffusion Models

​Feature Support Matrix

​Key LLM Models

​Multimodal Models

​Model Implementation Location

​Using Supported Models

​Next Steps

Model Configuration

Custom Models

Quantization

Deployment Guide

Build docs developers (and LLMs) love

Model Architecture Support

Decoder-Only Language Models

NVIDIA Nemotron Family

GLM Family

Other MoE Models

Encoder-Decoder Models

Speculative Decoding Models

Vision & Multimodal Models

Diffusion Models

Feature Support Matrix

Key LLM Models

Multimodal Models

Model Implementation Location

Using Supported Models

Next Steps