How generative models work
In vLLM, generative models implement theVllmModelForTextGeneration interface. Based on the final hidden states of the input, these models output log probabilities of tokens to generate, which are then passed through a Sampler to obtain the final text.
Configuration
Model runner
Run a model in generation mode via--runner generate.
In most cases, you don’t need to set this option as vLLM automatically detects the model runner via
--runner auto.Offline inference
TheLLM class provides various methods for offline inference.
Basic generation
Thegenerate() method is available to all generative models. It’s similar to HuggingFace Transformers’ generate(), except tokenization and detokenization are performed automatically.
Controlling generation
You can control language generation by passingSamplingParams. For example, use greedy sampling by setting temperature=0:
Beam search
Thebeam_search() method implements beam search on top of generate():
Chat interface
Thechat() method implements chat functionality on top of generate(). It accepts input similar to OpenAI Chat Completions API and automatically applies the model’s chat template.
Only instruction-tuned models have a chat template. Base models may perform poorly as they are not trained to respond to chat conversations.
Custom chat templates
Custom chat templates
If the model doesn’t have a chat template or you want to specify another one:
Online serving
Our OpenAI-Compatible Server provides endpoints that correspond to the offline APIs:- Completions API - Similar to
LLM.generate()but only accepts text - Chat API - Similar to
LLM.chat(), accepting both text and multi-modal inputs for models with a chat template
Supported architectures
The following table lists text-only generative model architectures natively supported by vLLM:For the complete and up-to-date list including example HuggingFace models, LoRA support, and pipeline parallelism status, see the supported models page.
Popular architectures
| Architecture | Models | Example |
|---|---|---|
LlamaForCausalLM | Llama 3.1, Llama 3, Llama 2, Yi | meta-llama/Meta-Llama-3.1-405B-Instruct |
Qwen2ForCausalLM | QwQ, Qwen2 | Qwen/Qwen2-7B-Instruct |
MistralForCausalLM | Ministral-3, Mistral, Mistral-Instruct | mistralai/Mistral-7B-v0.1 |
Gemma2ForCausalLM | Gemma 2 | google/gemma-2-9b |
DeepseekV3ForCausalLM | DeepSeek-V3 | deepseek-ai/DeepSeek-V3 |
MixtralForCausalLM | Mixtral-8x7B | mistralai/Mixtral-8x7B-v0.1 |
GPT2LMHeadModel | GPT-2 | openai-community/gpt2 |
BloomForCausalLM | BLOOM, BLOOMZ | bigscience/bloom |
OPTForCausalLM | OPT, OPT-IML | facebook/opt-66b |
Phi3ForCausalLM | Phi-4, Phi-3 | microsoft/Phi-4-mini-instruct |
Mixture-of-experts models
vLLM supports various MoE architectures:- DeepSeek: DeepSeek-V2, DeepSeek-V3, DeepSeek-R1
- Mixtral: Mixtral-8x7B, Mixtral-8x22B
- Qwen: Qwen2MoE, Qwen3MoE
- DBRX: DBRX Base and Instruct
- Arctic: Snowflake Arctic
- Granite: Granite MoE variants
- GLM: GLM-4.5, GLM-4.6, GLM-4.7
State space models
vLLM supports efficient sequence modeling architectures:- Mamba:
MambaForCausalLM(Mamba, Mamba2) - Jamba:
JambaForCausalLM(AI21 Jamba) - Falcon Mamba:
FalconMambaForCausalLM
Specialized architectures
Code models
Code models
- StarCoder:
GPTBigCodeForCausalLM - Starcoder2:
Starcoder2ForCausalLM - WizardCoder: Based on StarCoder
- IQuest Coder:
IQuestCoderForCausalLM
Chinese/multilingual models
Chinese/multilingual models
- ChatGLM:
ChatGLMModel - Baichuan:
BaiChuanForCausalLM - InternLM:
InternLMForCausalLM,InternLM2ForCausalLM,InternLM3ForCausalLM - Qwen: Multiple generations supported
- Ernie:
Ernie4_5ForCausalLM,Ernie4_5_MoeForCausalLM
Long-context models
Long-context models
- LongCat-Flash:
LongcatFlashForCausalLM - Kimi-Linear:
KimiLinearForCausalLM(48B context)
Lightweight models
Lightweight models
- MiniCPM:
MiniCPMForCausalLM,MiniCPM3ForCausalLM - SmolLM3:
SmolLM3ForCausalLM - Gemma 3:
Gemma3ForCausalLM(1B-27B)
Performance features
vLLM generative models support:- PagedAttention - Efficient memory management for KV cache
- Continuous batching - Dynamic batching of requests
- Quantization - GPTQ, AWQ, INT4/8, FP8 (see Quantization)
- Tensor parallelism - Distribute model across GPUs
- Pipeline parallelism - Split model across pipeline stages
- LoRA adapters - Multi-LoRA support for fine-tuned models
- Speculative decoding - Faster generation with draft models
- Prefix caching - Cache common prompt prefixes
Next steps
Multimodal models
Vision, audio, and video language models
Quantization
Reduce model memory with quantization
LoRA adapters
Serve multiple fine-tuned variants
OpenAI API
Serve models with OpenAI-compatible API