Skip to main content
TensorRT-LLM supports a wide range of model architectures for optimized LLM inference on NVIDIA GPUs. This page provides a comprehensive overview of supported models, organized by category.

Model Architecture Support

The following table lists all supported model architectures with links to example models on HuggingFace:

Decoder-Only Language Models

ArchitectureModelsHuggingFace Example
LlamaForCausalLMLlama 3.1, Llama 3, Llama 2, LLaMAmeta-llama/Meta-Llama-3.1-70B
Llama4ForConditionalGenerationLlama 4meta-llama/Llama-4-Scout-17B-16E-Instruct
MllamaForConditionalGenerationLlama 3.2 (Vision)meta-llama/Llama-3.2-11B-Vision
Features:
  • Full tensor parallelism support
  • RoPE (Rotary Position Embedding)
  • GQA (Grouped Query Attention)
  • FP8/INT8 quantization support
ArchitectureModelsHuggingFace Example
GPTForCausalLMGPT-2, GPT-3, GPT-JVarious GPT models
GPTJForCausalLMGPT-JEleutherAI/gpt-j-6b
GPTNeoXForCausalLMGPT-NeoXEleutherAI/gpt-neox-20b
GptOssForCausalLMGPT-OSSopenai/gpt-oss-120b
Features:
  • Standard attention mechanisms
  • Position embeddings
  • Dense feedforward layers
ArchitectureModelsHuggingFace Example
MistralForCausalLMMistral 7B, Mistral variantsmistralai/Mistral-7B-v0.1
MixtralForCausalLMMixtral MoE modelsmistralai/Mixtral-8x7B-v0.1
Mistral3ForConditionalGenerationMistral 3 (Multimodal)Mistral 3 vision models
Features:
  • Sliding window attention
  • Sparse Mixture of Experts (MoE)
  • RoPE embeddings
ArchitectureModelsHuggingFace Example
Qwen2ForCausalLMQwQ, Qwen2Qwen/Qwen2-7B-Instruct
Qwen3ForCausalLMQwen3Qwen/Qwen3-8B
Qwen3MoeForCausalLMQwen3 MoEQwen/Qwen3-30B-A3B
Qwen3NextForCausalLMQwen3-NextQwen/Qwen3-Next-80B-A3B-Thinking
Qwen3_5MoeForCausalLMQwen3.5-MoEQwen/Qwen3.5-397B-A17B
Qwen2VLForConditionalGenerationQwen2-VL (Vision)Qwen2-VL models
Qwen2_5_VLForConditionalGenerationQwen2.5-VLQwen2.5-VL models
Qwen3VLForConditionalGenerationQwen3-VLQwen3-VL models
Qwen3VLMoeForConditionalGenerationQwen3-VL-MoEQwen3-VL-MoE models
Specialized Models:
  • Qwen2ForProcessRewardModel - Process reward modeling: Qwen/Qwen2.5-Math-PRM-7B
  • Qwen2ForRewardModel - Reward modeling: Qwen/Qwen2.5-Math-RM-72B
ArchitectureModelsHuggingFace Example
GemmaForCausalLMGemma, Gemma 2Google Gemma models
Gemma3ForCausalLMGemma 3google/gemma-3-1b-it
Gemma3ForConditionalGenerationGemma 3 (Multimodal)Gemma 3 vision models
RecurrentGemmaForCausalLMRecurrentGemmaRecurrentGemma models
Features:
  • Multi-query attention
  • Sliding window patterns (Gemma 3)
  • RoPE with local base frequency
ArchitectureModelsHuggingFace Example
DeepseekForCausalLMDeepSeek v1DeepSeek v1 models
DeepseekV2ForCausalLMDeepSeek v2DeepSeek v2 models
DeepseekV3ForCausalLMDeepSeek-V3deepseek-ai/DeepSeek-V3
DeepseekV32ForCausalLMDeepSeek-V3.2deepseek-ai/DeepSeek-V3.2
Features:
  • Multi-head Latent Attention (MLA)
  • Sparse Mixture of Experts
  • FP8 support with specialized kernels
ArchitectureModelsHuggingFace Example
PhiForCausalLMPhi-1, Phi-2Microsoft Phi models
Phi3ForCausalLMPhi-3, Phi-4microsoft/Phi-4
Phi4MMForCausalLMPhi-4 MultimodalPhi-4 vision/audio models
ArchitectureModelsHuggingFace Example
BaichuanForCausalLMBaichuan, Baichuan2Baichuan models
BloomForCausalLMBLOOMbigscience/bloom
ChatGLMForCausalLMChatGLM, ChatGLM2, ChatGLM3THUDM ChatGLM models
CohereForCausalLMCommand R, Command R+Cohere models
CogVLMForCausalLMCogVLM (Vision)CogVLM models
DbrxForCausalLMDBRXDatabricks DBRX models
FalconForCausalLMFalconTII Falcon models
GrokForCausalLMGrokxAI Grok models
MambaForCausalLMMambaState-space models
MPTForCausalLMMPTMosaicML MPT models
OPTForCausalLMOPTMeta OPT models

NVIDIA Nemotron Family

ArchitectureModelsHuggingFace Example
NemotronForCausalLMNemotron-3, Nemotron-4, Minitronnvidia/Minitron-8B-Base
NemotronHForCausalLMNemotron-3-Nanonvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8
NemotronNASForCausalLMNemotronNASnvidia/Llama-3_3-Nemotron-Super-49B-v1
DeciLMForCausalLMNemotron (DeciLM)nvidia/Llama-3_1-Nemotron-51B-Instruct
NemotronH_Nano_VL_V2Nemotron Nano VL (Vision)Nemotron vision models

GLM Family

ArchitectureModelsHuggingFace Example
Glm4MoeForCausalLMGLM-4.5, GLM-4.6, GLM-4.7THUDM/GLM-4-100B-A10B
Glm4MoeLiteForCausalLMGLM-4.7-Flashzai-org/GLM-4.7-Flash

Other MoE Models

ArchitectureModelsHuggingFace Example
MiniMaxM2ForCausalLMMiniMax M2/M2.1MiniMaxAI/MiniMax-M2
ExaoneMoEForCausalLMK-EXAONELGAI-EXAONE/K-EXAONE-236B-A23B
Exaone4ForCausalLMEXAONE 4.0LGAI-EXAONE/EXAONE-4.0-32B

Encoder-Decoder Models

ArchitectureModelsHuggingFace Example
BertForSequenceClassificationBERT-based classifierstextattack/bert-base-uncased-yelp-polarity
BertForQuestionAnsweringBERT-based QABERT QA models
RobertaForSequenceClassificationRoBERTa classifiersRoBERTa models
WhisperEncoderWhisper (Speech)OpenAI Whisper models

Speculative Decoding Models

ArchitectureModelsPurpose
MedusaForCausalLmMedusaMulti-head speculative decoding
EagleForCausalLMEAGLEEfficient speculative sampling
ReDrafterForLLaMALMReDrafterDraft model for LLaMA
ReDrafterForQWenLMReDrafterDraft model for Qwen

Vision & Multimodal Models

ArchitectureModalitiesModels
LlavaNextForConditionalGenerationLanguage + ImageLLaVA-NeXT models
LlavaLlamaModelLanguage + Image + VideoVILA models
CLIPVisionTransformerVision EncoderCLIP models
HCXVisionForCausalLMLanguage + ImageHCX Vision models

Diffusion Models

ArchitecturePurposeModels
DiTDiffusion TransformerDiT models
SD3Transformer2DModelStable Diffusion 3SD3 models
STDiT3ModelSpatiotemporal DiTVideo generation models
UNetDiffusion UNetVarious UNet models

Feature Support Matrix

Support for features may vary by model architecture. The table below shows support for key production models.

Key LLM Models

Model ArchitectureOverlap SchedulerCUDA GraphAttention DPDisaggregated ServingChunked PrefillMTP
DeepseekV3ForCausalLM✓*
DeepseekV32ForCausalLM
Glm4MoeForCausalLMUntested
Qwen3MoeForCausalLM-
Qwen3NextForCausalLM-Untested-
Llama4ForConditionalGeneration-
GptOssForCausalLM-
Chunked Prefill for MLA: DeepSeek V3’s Multi-head Latent Attention can only enable chunked prefill on SM100/SM103 GPUs.KV Cache Reuse: For MLA architectures, KV cache reuse requires SM90/SM100/SM103 GPUs and BF16/FP8 KV cache dtype.

Multimodal Models

Model ArchitectureModalityOverlap SchedulerCUDA GraphChunked PrefillEPD Disaggregated Serving
Gemma3ForConditionalGenerationL + IN/A-
HCXVisionForCausalLML + I--
LlavaNextForConditionalGenerationL + I
Llama4ForConditionalGenerationL + I--
Mistral3ForConditionalGenerationL + I-
Phi4MMForCausalLML + I + A-
Qwen2VLForConditionalGenerationL + I + V-
Qwen2_5_VLForConditionalGenerationL + I + V
Qwen3VLForConditionalGenerationL + I + V
Qwen3VLMoeForConditionalGenerationL + I + V
LlavaLlamaModel (VILA)L + I + V--
NemotronH_Nano_VL_V2L + I + V-
Modality Legend:
  • L: Language (text)
  • I: Image
  • V: Video
  • A: Audio

Model Implementation Location

All model implementations are located in the TensorRT-LLM source:
tensorrt_llm/models/
├── llama/          # LLaMA family models
├── gpt/            # GPT family models
├── gemma/          # Gemma family models
├── qwen/           # Qwen family models
├── deepseek_v1/    # DeepSeek v1
├── deepseek_v2/    # DeepSeek v2/v3
├── falcon/         # Falcon models
├── phi/            # Phi models
├── phi3/           # Phi-3/4 models
└── ...
Each model directory contains:
  • config.py - Model configuration class
  • model.py - Model architecture implementation
  • convert.py - Weight conversion utilities

Using Supported Models

from tensorrt_llm import LLM

# Load any supported model from HuggingFace
llm = LLM(model="meta-llama/Meta-Llama-3.1-70B")

output = llm.generate(["Hello, how are you?"])
print(output)

Next Steps

Model Configuration

Learn how to configure models with custom parameters

Custom Models

Add your own custom model architectures

Quantization

Optimize models with INT8, FP8, and INT4 quantization

Deployment Guide

Deploy models to production

Build docs developers (and LLMs) love