Many decoder language models can now be automatically loaded using the Transformers modeling backend without having to implement them in vLLM. Try running
vllm serve <model> first to see if it works!Overview
vLLM models are specialized PyTorch models that take advantage of various vLLM features to optimize their performance. The complexity of integrating a model into vLLM depends heavily on the model’s architecture:- Simple: Model shares similar architecture with an existing vLLM model
- Moderate: Model has standard components but unique architecture
- Complex: Model includes new operators (e.g., new attention mechanism)
Step 1: Bring your model code
First, clone the PyTorch model code from the source repository. For instance, vLLM’s OPT model was adapted from HuggingFace’s modeling_opt.py file.Step 2: Make your code compatible with vLLM
Initialization code
All vLLM modules within the model must include aprefix argument in their constructor.
The prefix is typically the full name of the module in the model’s state dictionary and is crucial for:
- Runtime support: vLLM’s attention operators are registered in a model’s state by their full names
- Non-uniform quantization support: Quantized checkpoints can selectively quantize certain layers. By providing the
prefixduring initialization, vLLM can match the current layer’s prefix with the quantization configuration
Computation code
Add embed_input_ids method
Add an This provides a unified interface in case your model is used within a composite multimodal model.
embed_input_ids method inside your model module that returns the text embeddings given input_ids:Currently, vLLM supports the basic multi-head attention mechanism and its variant with rotary positional embeddings. If your model employs a different attention mechanism, you will need to implement a new attention layer in vLLM.
Step 3: Implement tensor parallelism and quantization support
If your model is too large to fit into a single GPU, you can use tensor parallelism to manage it.Parallel layers
Replace your model’s linear and embedding layers with their tensor-parallel versions:| Layer | Purpose | Usage |
|---|---|---|
VocabParallelEmbedding | Embedding layer | Input embeddings |
ParallelLMHead | Output layer | LM head |
ReplicatedLinear | Replicated linear | No memory saving, inputs and weights replicated |
RowParallelLinear | Row-parallel linear | Second FFN layer, attention output |
ColumnParallelLinear | Column-parallel linear | First FFN layer, QKV projection |
MergedColumnParallelLinear | Merged column-parallel | First FFN layer with weighted activation |
QKVParallelLinear | QKV projection | Multi-head and grouped-query attention |
Linear method for quantization
All linear layers above takelinear_method as an input. vLLM will set this parameter according to different quantization schemes to support weight quantization.
Step 4: Implement weight loading logic
Implement theload_weights method in your *ForCausalLM class. This method should:
- Load weights from the HuggingFace checkpoint file
- Assign them to the corresponding layers in your model
- Handle merged layers (
MergedColumnParallelLinear,QKVParallelLinear) by loading separated weight matrices
Step 5: Special model architectures
Models with interleaving sliding windows
To support a model with interleaving sliding windows:- Make sure the model’s
config.jsoncontainslayer_types - In the modeling code, parse the correct sliding window value for every layer and pass it to the attention layer’s
per_layer_sliding_windowargument
Models that use Mamba
vLLM supports three different scenarios:- Mamba-only models
- Hybrid Mamba + Attention
- Custom Mamba-like layers
Models that use Mamba layers (Mamba-1 or Mamba-2) but do not use attention layers.
- Inherit protocol
IsAttentionFree - Implement class methods
get_mamba_state_dtype_from_configandget_mamba_state_shape_from_config - Use
MambaMixer(Mamba-1) orMambaMixer2(Mamba-2) classes - Add model to
MODELS_CONFIG_MAPinvllm/model_executor/models/config.py
MambaForCausalLM (Mamba-1) or Mamba2ForCausalLM (Mamba-2)Next steps
After implementing your model:Model registration
Register your model with vLLM
Testing guide
Write tests for your model
Multimodal support
Add multimodal capabilities