> ## Documentation Index
> Fetch the complete documentation index at: https://mintlify.com/NVIDIA/TensorRT-LLM/llms.txt
> Use this file to discover all available pages before exploring further.

# Supported Models

> Complete list of model architectures supported by TensorRT-LLM

TensorRT-LLM supports a wide range of model architectures for optimized LLM inference on NVIDIA GPUs. This page provides a comprehensive overview of supported models, organized by category.

## Model Architecture Support

The following table lists all supported model architectures with links to example models on HuggingFace:

### Decoder-Only Language Models

<AccordionGroup>
  <Accordion title="LLaMA Family" icon="meta">
    | Architecture                     | Models                             | HuggingFace Example                         |
    | -------------------------------- | ---------------------------------- | ------------------------------------------- |
    | `LlamaForCausalLM`               | Llama 3.1, Llama 3, Llama 2, LLaMA | `meta-llama/Meta-Llama-3.1-70B`             |
    | `Llama4ForConditionalGeneration` | Llama 4                            | `meta-llama/Llama-4-Scout-17B-16E-Instruct` |
    | `MllamaForConditionalGeneration` | Llama 3.2 (Vision)                 | `meta-llama/Llama-3.2-11B-Vision`           |

    **Features:**

    * Full tensor parallelism support
    * RoPE (Rotary Position Embedding)
    * GQA (Grouped Query Attention)
    * FP8/INT8 quantization support
  </Accordion>

  <Accordion title="GPT Family" icon="openai">
    | Architecture         | Models              | HuggingFace Example       |
    | -------------------- | ------------------- | ------------------------- |
    | `GPTForCausalLM`     | GPT-2, GPT-3, GPT-J | Various GPT models        |
    | `GPTJForCausalLM`    | GPT-J               | `EleutherAI/gpt-j-6b`     |
    | `GPTNeoXForCausalLM` | GPT-NeoX            | `EleutherAI/gpt-neox-20b` |
    | `GptOssForCausalLM`  | GPT-OSS             | `openai/gpt-oss-120b`     |

    **Features:**

    * Standard attention mechanisms
    * Position embeddings
    * Dense feedforward layers
  </Accordion>

  <Accordion title="Mistral & Mixtral" icon="stars">
    | Architecture                       | Models                       | HuggingFace Example           |
    | ---------------------------------- | ---------------------------- | ----------------------------- |
    | `MistralForCausalLM`               | Mistral 7B, Mistral variants | `mistralai/Mistral-7B-v0.1`   |
    | `MixtralForCausalLM`               | Mixtral MoE models           | `mistralai/Mixtral-8x7B-v0.1` |
    | `Mistral3ForConditionalGeneration` | Mistral 3 (Multimodal)       | Mistral 3 vision models       |

    **Features:**

    * Sliding window attention
    * Sparse Mixture of Experts (MoE)
    * RoPE embeddings
  </Accordion>

  <Accordion title="Qwen Family" icon="robot">
    | Architecture                         | Models            | HuggingFace Example                |
    | ------------------------------------ | ----------------- | ---------------------------------- |
    | `Qwen2ForCausalLM`                   | QwQ, Qwen2        | `Qwen/Qwen2-7B-Instruct`           |
    | `Qwen3ForCausalLM`                   | Qwen3             | `Qwen/Qwen3-8B`                    |
    | `Qwen3MoeForCausalLM`                | Qwen3 MoE         | `Qwen/Qwen3-30B-A3B`               |
    | `Qwen3NextForCausalLM`               | Qwen3-Next        | `Qwen/Qwen3-Next-80B-A3B-Thinking` |
    | `Qwen3_5MoeForCausalLM`              | Qwen3.5-MoE       | `Qwen/Qwen3.5-397B-A17B`           |
    | `Qwen2VLForConditionalGeneration`    | Qwen2-VL (Vision) | Qwen2-VL models                    |
    | `Qwen2_5_VLForConditionalGeneration` | Qwen2.5-VL        | Qwen2.5-VL models                  |
    | `Qwen3VLForConditionalGeneration`    | Qwen3-VL          | Qwen3-VL models                    |
    | `Qwen3VLMoeForConditionalGeneration` | Qwen3-VL-MoE      | Qwen3-VL-MoE models                |

    **Specialized Models:**

    * `Qwen2ForProcessRewardModel` - Process reward modeling: `Qwen/Qwen2.5-Math-PRM-7B`
    * `Qwen2ForRewardModel` - Reward modeling: `Qwen/Qwen2.5-Math-RM-72B`
  </Accordion>

  <Accordion title="Gemma Family" icon="sparkles">
    | Architecture                     | Models               | HuggingFace Example    |
    | -------------------------------- | -------------------- | ---------------------- |
    | `GemmaForCausalLM`               | Gemma, Gemma 2       | Google Gemma models    |
    | `Gemma3ForCausalLM`              | Gemma 3              | `google/gemma-3-1b-it` |
    | `Gemma3ForConditionalGeneration` | Gemma 3 (Multimodal) | Gemma 3 vision models  |
    | `RecurrentGemmaForCausalLM`      | RecurrentGemma       | RecurrentGemma models  |

    **Features:**

    * Multi-query attention
    * Sliding window patterns (Gemma 3)
    * RoPE with local base frequency
  </Accordion>

  <Accordion title="DeepSeek Family" icon="brain">
    | Architecture             | Models        | HuggingFace Example         |
    | ------------------------ | ------------- | --------------------------- |
    | `DeepseekForCausalLM`    | DeepSeek v1   | DeepSeek v1 models          |
    | `DeepseekV2ForCausalLM`  | DeepSeek v2   | DeepSeek v2 models          |
    | `DeepseekV3ForCausalLM`  | DeepSeek-V3   | `deepseek-ai/DeepSeek-V3`   |
    | `DeepseekV32ForCausalLM` | DeepSeek-V3.2 | `deepseek-ai/DeepSeek-V3.2` |

    **Features:**

    * Multi-head Latent Attention (MLA)
    * Sparse Mixture of Experts
    * FP8 support with specialized kernels
  </Accordion>

  <Accordion title="Phi Family" icon="microsoft">
    | Architecture        | Models           | HuggingFace Example       |
    | ------------------- | ---------------- | ------------------------- |
    | `PhiForCausalLM`    | Phi-1, Phi-2     | Microsoft Phi models      |
    | `Phi3ForCausalLM`   | Phi-3, Phi-4     | `microsoft/Phi-4`         |
    | `Phi4MMForCausalLM` | Phi-4 Multimodal | Phi-4 vision/audio models |
  </Accordion>

  <Accordion title="Other Decoder Models" icon="grid">
    | Architecture          | Models                      | HuggingFace Example    |
    | --------------------- | --------------------------- | ---------------------- |
    | `BaichuanForCausalLM` | Baichuan, Baichuan2         | Baichuan models        |
    | `BloomForCausalLM`    | BLOOM                       | `bigscience/bloom`     |
    | `ChatGLMForCausalLM`  | ChatGLM, ChatGLM2, ChatGLM3 | THUDM ChatGLM models   |
    | `CohereForCausalLM`   | Command R, Command R+       | Cohere models          |
    | `CogVLMForCausalLM`   | CogVLM (Vision)             | CogVLM models          |
    | `DbrxForCausalLM`     | DBRX                        | Databricks DBRX models |
    | `FalconForCausalLM`   | Falcon                      | TII Falcon models      |
    | `GrokForCausalLM`     | Grok                        | xAI Grok models        |
    | `MambaForCausalLM`    | Mamba                       | State-space models     |
    | `MPTForCausalLM`      | MPT                         | MosaicML MPT models    |
    | `OPTForCausalLM`      | OPT                         | Meta OPT models        |
  </Accordion>
</AccordionGroup>

### NVIDIA Nemotron Family

| Architecture             | Models                           | HuggingFace Example                         |
| ------------------------ | -------------------------------- | ------------------------------------------- |
| `NemotronForCausalLM`    | Nemotron-3, Nemotron-4, Minitron | `nvidia/Minitron-8B-Base`                   |
| `NemotronHForCausalLM`   | Nemotron-3-Nano                  | `nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8` |
| `NemotronNASForCausalLM` | NemotronNAS                      | `nvidia/Llama-3_3-Nemotron-Super-49B-v1`    |
| `DeciLMForCausalLM`      | Nemotron (DeciLM)                | `nvidia/Llama-3_1-Nemotron-51B-Instruct`    |
| `NemotronH_Nano_VL_V2`   | Nemotron Nano VL (Vision)        | Nemotron vision models                      |

### GLM Family

| Architecture             | Models                    | HuggingFace Example     |
| ------------------------ | ------------------------- | ----------------------- |
| `Glm4MoeForCausalLM`     | GLM-4.5, GLM-4.6, GLM-4.7 | `THUDM/GLM-4-100B-A10B` |
| `Glm4MoeLiteForCausalLM` | GLM-4.7-Flash             | `zai-org/GLM-4.7-Flash` |

### Other MoE Models

| Architecture           | Models          | HuggingFace Example              |
| ---------------------- | --------------- | -------------------------------- |
| `MiniMaxM2ForCausalLM` | MiniMax M2/M2.1 | `MiniMaxAI/MiniMax-M2`           |
| `ExaoneMoEForCausalLM` | K-EXAONE        | `LGAI-EXAONE/K-EXAONE-236B-A23B` |
| `Exaone4ForCausalLM`   | EXAONE 4.0      | `LGAI-EXAONE/EXAONE-4.0-32B`     |

### Encoder-Decoder Models

| Architecture                       | Models                 | HuggingFace Example                          |
| ---------------------------------- | ---------------------- | -------------------------------------------- |
| `BertForSequenceClassification`    | BERT-based classifiers | `textattack/bert-base-uncased-yelp-polarity` |
| `BertForQuestionAnswering`         | BERT-based QA          | BERT QA models                               |
| `RobertaForSequenceClassification` | RoBERTa classifiers    | RoBERTa models                               |
| `WhisperEncoder`                   | Whisper (Speech)       | OpenAI Whisper models                        |

### Speculative Decoding Models

| Architecture          | Models    | Purpose                         |
| --------------------- | --------- | ------------------------------- |
| `MedusaForCausalLm`   | Medusa    | Multi-head speculative decoding |
| `EagleForCausalLM`    | EAGLE     | Efficient speculative sampling  |
| `ReDrafterForLLaMALM` | ReDrafter | Draft model for LLaMA           |
| `ReDrafterForQWenLM`  | ReDrafter | Draft model for Qwen            |

### Vision & Multimodal Models

| Architecture                        | Modalities               | Models            |
| ----------------------------------- | ------------------------ | ----------------- |
| `LlavaNextForConditionalGeneration` | Language + Image         | LLaVA-NeXT models |
| `LlavaLlamaModel`                   | Language + Image + Video | VILA models       |
| `CLIPVisionTransformer`             | Vision Encoder           | CLIP models       |
| `HCXVisionForCausalLM`              | Language + Image         | HCX Vision models |

### Diffusion Models

| Architecture            | Purpose               | Models                  |
| ----------------------- | --------------------- | ----------------------- |
| `DiT`                   | Diffusion Transformer | DiT models              |
| `SD3Transformer2DModel` | Stable Diffusion 3    | SD3 models              |
| `STDiT3Model`           | Spatiotemporal DiT    | Video generation models |
| `UNet`                  | Diffusion UNet        | Various UNet models     |

## Feature Support Matrix

<Note>
  Support for features may vary by model architecture. The table below shows support for key production models.
</Note>

### Key LLM Models

| Model Architecture               | Overlap Scheduler | CUDA Graph | Attention DP | Disaggregated Serving | Chunked Prefill | MTP |
| -------------------------------- | ----------------- | ---------- | ------------ | --------------------- | --------------- | --- |
| `DeepseekV3ForCausalLM`          | ✓                 | ✓          | ✓            | ✓                     | ✓\*             | ✓   |
| `DeepseekV32ForCausalLM`         | ✓                 | ✓          | ✓            | ✓                     | ✓               | ✓   |
| `Glm4MoeForCausalLM`             | ✓                 | ✓          | ✓            | Untested              | ✓               | ✓   |
| `Qwen3MoeForCausalLM`            | ✓                 | ✓          | ✓            | ✓                     | ✓               | -   |
| `Qwen3NextForCausalLM`           | ✓                 | ✓          | -            | Untested              | ✓               | -   |
| `Llama4ForConditionalGeneration` | ✓                 | ✓          | ✓            | ✓                     | ✓               | -   |
| `GptOssForCausalLM`              | ✓                 | ✓          | ✓            | ✓                     | ✓               | -   |

<Info>
  **Chunked Prefill for MLA**: DeepSeek V3's Multi-head Latent Attention can only enable chunked prefill on SM100/SM103 GPUs.

  **KV Cache Reuse**: For MLA architectures, KV cache reuse requires SM90/SM100/SM103 GPUs and BF16/FP8 KV cache dtype.
</Info>

### Multimodal Models

| Model Architecture                   | Modality  | Overlap Scheduler | CUDA Graph | Chunked Prefill | EPD Disaggregated Serving |
| ------------------------------------ | --------- | ----------------- | ---------- | --------------- | ------------------------- |
| `Gemma3ForConditionalGeneration`     | L + I     | ✓                 | ✓          | N/A             | -                         |
| `HCXVisionForCausalLM`               | L + I     | ✓                 | ✓          | -               | -                         |
| `LlavaNextForConditionalGeneration`  | L + I     | ✓                 | ✓          | ✓               | ✓                         |
| `Llama4ForConditionalGeneration`     | L + I     | ✓                 | ✓          | -               | -                         |
| `Mistral3ForConditionalGeneration`   | L + I     | ✓                 | ✓          | ✓               | -                         |
| `Phi4MMForCausalLM`                  | L + I + A | ✓                 | ✓          | ✓               | -                         |
| `Qwen2VLForConditionalGeneration`    | L + I + V | ✓                 | ✓          | ✓               | -                         |
| `Qwen2_5_VLForConditionalGeneration` | L + I + V | ✓                 | ✓          | ✓               | ✓                         |
| `Qwen3VLForConditionalGeneration`    | L + I + V | ✓                 | ✓          | ✓               | ✓                         |
| `Qwen3VLMoeForConditionalGeneration` | L + I + V | ✓                 | ✓          | ✓               | ✓                         |
| `LlavaLlamaModel` (VILA)             | L + I + V | ✓                 | ✓          | -               | -                         |
| `NemotronH_Nano_VL_V2`               | L + I + V | ✓                 | ✓          | ✓               | -                         |

**Modality Legend:**

* L: Language (text)
* I: Image
* V: Video
* A: Audio

## Model Implementation Location

All model implementations are located in the TensorRT-LLM source:

```
tensorrt_llm/models/
├── llama/          # LLaMA family models
├── gpt/            # GPT family models
├── gemma/          # Gemma family models
├── qwen/           # Qwen family models
├── deepseek_v1/    # DeepSeek v1
├── deepseek_v2/    # DeepSeek v2/v3
├── falcon/         # Falcon models
├── phi/            # Phi models
├── phi3/           # Phi-3/4 models
└── ...
```

Each model directory contains:

* `config.py` - Model configuration class
* `model.py` - Model architecture implementation
* `convert.py` - Weight conversion utilities

## Using Supported Models

<CodeGroup>
  ```python Basic Usage theme={null}
  from tensorrt_llm import LLM

  # Load any supported model from HuggingFace
  llm = LLM(model="meta-llama/Meta-Llama-3.1-70B")

  output = llm.generate(["Hello, how are you?"])
  print(output)
  ```

  ```python With Quantization theme={null}
  from tensorrt_llm import LLM

  llm = LLM(
      model="meta-llama/Meta-Llama-3.1-70B",
      tensor_parallel_size=4,
      enable_quantization="fp8"
  )

  output = llm.generate(["Explain quantum computing"])
  ```

  ```python Multimodal theme={null}
  from tensorrt_llm import LLM

  llm = LLM(model="meta-llama/Llama-3.2-11B-Vision")

  output = llm.generate({
      "prompt": "What's in this image?",
      "image": "path/to/image.jpg"
  })
  ```
</CodeGroup>

## Next Steps

<CardGroup cols={2}>
  <Card title="Model Configuration" icon="sliders" href="/models/model-configuration">
    Learn how to configure models with custom parameters
  </Card>

  <Card title="Custom Models" icon="code" href="/models/custom-models">
    Add your own custom model architectures
  </Card>

  <Card title="Quantization" icon="compress" href="/features/quantization">
    Optimize models with INT8, FP8, and INT4 quantization
  </Card>

  <Card title="Deployment Guide" icon="rocket" href="/deployment-guide/overview">
    Deploy models to production
  </Card>
</CardGroup>


Built with [Mintlify](https://mintlify.com).