Model categories
vLLM organizes models into three main categories:Generative models
Text generation, chat, and multimodal language models
Multimodal models
Vision-language, audio, and video models
Pooling models
Embedding, classification, and reward models
Model implementation
Native vLLM support
If vLLM natively supports a model, its implementation can be found invllm/model_executor/models/. These models are listed in the supported model tables and typically offer the best performance.
Transformers modeling backend
vLLM also supports model implementations available in Transformers. You should expect the performance of a Transformers model implementation used in vLLM to be within 5% of a dedicated vLLM implementation.The Transformers modeling backend works for:
- Modalities: Embedding models, language models, and vision-language models
- Architectures: Encoder-only, decoder-only, mixture-of-experts
- Attention types: Full attention and/or sliding attention
Transformers..., it’s using the Transformers implementation.
To force the Transformers backend:
Plugin support
Some model architectures are supported via vLLM plugins:| Architecture | Models | Plugin Repository |
|---|---|---|
BartForConditionalGeneration | BART | bart-plugin |
Florence2ForConditionalGeneration | Florence-2 | bart-plugin |
Loading models
From Hugging Face Hub
By default, vLLM loads models from Hugging Face Hub. To change the download path, set theHF_HOME environment variable.
To check if a model is natively supported, look at the architectures field in the model’s config.json. If it matches an architecture in the supported model tables, it should work natively.
Test if your model is supported
Test if your model is supported
The easiest way to check if your model works at runtime:If vLLM successfully returns output, your model is supported.
Download models manually
Download models manually
Use the Hugging Face CLI to download models:List downloaded models:Delete cached models:
From ModelScope
To use models from ModelScope:Feature status legend
Throughout the model tables:- ✅ Feature is fully supported
- 🚧 Feature is planned but not yet supported
- ⚠️ Feature is available but may have known issues or limitations
- (blank) Feature status is unknown or not applicable
Next steps
Generative models
Explore text and multimodal generation models
Pooling models
Learn about embedding and classification models
Quantization
Reduce model size with quantization
Add a model
Contribute support for new models