Skip to main content
vLLM supports generative and pooling models across various tasks. For each task, we list the model architectures that have been implemented in vLLM alongside popular models that use them.

Model categories

vLLM organizes models into three main categories:

Generative models

Text generation, chat, and multimodal language models

Multimodal models

Vision-language, audio, and video models

Pooling models

Embedding, classification, and reward models

Model implementation

Native vLLM support

If vLLM natively supports a model, its implementation can be found in vllm/model_executor/models/. These models are listed in the supported model tables and typically offer the best performance.

Transformers modeling backend

vLLM also supports model implementations available in Transformers. You should expect the performance of a Transformers model implementation used in vLLM to be within 5% of a dedicated vLLM implementation.
The Transformers modeling backend works for:
  • Modalities: Embedding models, language models, and vision-language models
  • Architectures: Encoder-only, decoder-only, mixture-of-experts
  • Attention types: Full attention and/or sliding attention
To check if the modeling backend is Transformers:
from vllm import LLM
llm = LLM(model="...")  # Name or path of your model
llm.apply_model(lambda model: print(type(model)))
If the printed type starts with Transformers..., it’s using the Transformers implementation. To force the Transformers backend:
# Offline inference
llm = LLM(model="your-model", model_impl="transformers")

# Online serving
vllm serve your-model --model-impl transformers

Plugin support

Some model architectures are supported via vLLM plugins:
ArchitectureModelsPlugin Repository
BartForConditionalGenerationBARTbart-plugin
Florence2ForConditionalGenerationFlorence-2bart-plugin
For other encoder-decoder models not natively supported, we recommend implementing support through the plugin system.

Loading models

From Hugging Face Hub

By default, vLLM loads models from Hugging Face Hub. To change the download path, set the HF_HOME environment variable. To check if a model is natively supported, look at the architectures field in the model’s config.json. If it matches an architecture in the supported model tables, it should work natively.
The easiest way to check if your model works at runtime:
from vllm import LLM

# For generative models
llm = LLM(model="...", runner="generate")
output = llm.generate("Hello, my name is")
print(output)

# For pooling models
llm = LLM(model="...", runner="pooling")
output = llm.encode("Hello, my name is")
print(output)
If vLLM successfully returns output, your model is supported.
Use the Hugging Face CLI to download models:
# Download a model
hf download HuggingFaceH4/zephyr-7b-beta

# Specify custom cache directory
hf download HuggingFaceH4/zephyr-7b-beta --cache-dir ./path/to/cache

# Download specific file
hf download HuggingFaceH4/zephyr-7b-beta eval_results.json
List downloaded models:
hf scan-cache
hf scan-cache -v  # verbose output
Delete cached models:
hf delete-cache

From ModelScope

To use models from ModelScope:
export VLLM_USE_MODELSCOPE=True
from vllm import LLM

llm = LLM(model="...", trust_remote_code=True)
output = llm.generate("Hello, my name is")
print(output)

Feature status legend

Throughout the model tables:
  • ✅ Feature is fully supported
  • 🚧 Feature is planned but not yet supported
  • ⚠️ Feature is available but may have known issues or limitations
  • (blank) Feature status is unknown or not applicable

Next steps

Generative models

Explore text and multimodal generation models

Pooling models

Learn about embedding and classification models

Quantization

Reduce model size with quantization

Add a model

Contribute support for new models

Build docs developers (and LLMs) love