How pooling models work
In vLLM, pooling models implement theVllmModelForPooling interface. These models use a Pooler to extract final hidden states from input before returning them.
Pooling model support is currently focused on convenience. We plan to optimize pooling models in vLLM. Please comment on #21796 if you have suggestions.
Configuration
Model runner
Run a model in pooling mode via--runner pooling.
In most cases, you don’t need to set this as vLLM automatically detects the appropriate model runner via
--runner auto.Model conversion
vLLM can adapt models for various pooling tasks via--convert <type>:
| Architecture | Convert Type | Supported Tasks |
|---|---|---|
*Model, *EmbeddingModel | embed | token_embed, embed |
*RewardModel | embed | token_embed, embed |
*ClassificationModel | classify | token_classify, classify, score |
Pooling tasks
Each pooling model supports one or more tasks:| Task | APIs |
|---|---|
embed | LLM.embed(...), LLM.encode(..., pooling_task="embed") |
classify | LLM.classify(...), LLM.encode(..., pooling_task="classify") |
score | LLM.score(...) |
token_classify | LLM.reward(...), LLM.encode(..., pooling_task="token_classify") |
token_embed | LLM.encode(..., pooling_task="token_embed") |
Embedding models
Embedding models convert text into dense vector representations, useful for semantic search, clustering, and retrieval.Basic usage
Supported architectures
| Architecture | Models | Example |
|---|---|---|
BertModel | BERT-based | BAAI/bge-base-en-v1.5 |
Gemma2Model | Gemma 2-based | BAAI/bge-multilingual-gemma2 |
LlamaModel | Llama-based | intfloat/e5-mistral-7b-instruct |
Qwen2Model | Qwen2-based | Alibaba-NLP/gte-Qwen2-7B-instruct |
Qwen3Model | Qwen3-based | Qwen/Qwen3-Embedding-0.6B |
NomicBertModel | Nomic BERT | nomic-ai/nomic-embed-text-v1 |
ModernBertModel | ModernBERT | Alibaba-NLP/gte-modernbert-base |
GteNewModel | mGTE-TRM | Alibaba-NLP/gte-multilingual-base |
Matryoshka embeddings
Matryoshka Representation Learning (MRL) allows trading off performance for cost by using smaller embedding dimensions.Online serving
Classification models
Classification models output probability distributions over predefined classes.Basic usage
Supported architectures
| Architecture | Example |
|---|---|
JambaForSequenceClassification | ai21labs/Jamba-tiny-reward-dev |
GPT2ForSequenceClassification | nie3e/sentiment-polish-gpt2-small |
BertForSequenceClassification | BERT classification models |
RobertaForSequenceClassification | RoBERTa classification models |
Cross-encoder / Reranker models
Cross-encoders score the relevance between query-document pairs, used for reranking in RAG systems.Basic usage
Supported architectures
| Architecture | Models | Example |
|---|---|---|
BertForSequenceClassification | BERT cross-encoders | cross-encoder/ms-marco-MiniLM-L-6-v2 |
XLMRobertaForSequenceClassification | XLM-RoBERTa | BAAI/bge-reranker-v2-m3 |
Qwen2ForSequenceClassification | Qwen2 reranker | mixedbread-ai/mxbai-rerank-base-v2 |
Qwen3ForSequenceClassification | Qwen3 reranker | Qwen/Qwen3-Reranker-0.6B |
GemmaForSequenceClassification | Gemma reranker | BAAI/bge-reranker-v2-gemma |
LlamaBidirectionalForSequenceClassification | Llama bidirectional | nvidia/llama-nemotron-rerank-1b-v2 |
Some reranker models require specific prompt formats. Check the model’s documentation and use score templates when needed. Example templates are available in
examples/pooling/score/template/.Online serving
Reward models
Reward models assign scalar scores to text, used in reinforcement learning from human feedback (RLHF).Basic usage
Supported architectures
| Architecture | Models | Example |
|---|---|---|
InternLM2ForRewardModel | InternLM2-based | internlm/internlm2-7b-reward |
LlamaForCausalLM | Llama-based PRM | peiyi9979/math-shepherd-mistral-7b-prm |
Qwen2ForRewardModel | Qwen2-based | Qwen/Qwen2.5-Math-RM-72B |
Qwen2ForProcessRewardModel | Qwen2 PRM | Qwen/Qwen2.5-Math-PRM-7B |
ColBERT late interaction models
ColBERT uses per-token embeddings and MaxSim scoring for document ranking, providing better accuracy than single-vector embeddings while being more efficient than cross-encoders.Supported architectures
| Architecture | Backbone | Example |
|---|---|---|
HF_ColBERT | BERT | answerdotai/answerai-colbert-small-v1 |
ColBERTModernBertModel | ModernBERT | lightonai/GTE-ModernColBERT-v1 |
ColBERTJinaRobertaModel | Jina XLM-RoBERTa | jinaai/jina-colbert-v2 |
Usage
BERT-based models work out of the box:Token classification models
Models for Named Entity Recognition (NER) and other token-level tasks.Supported architectures
| Architecture | Example |
|---|---|
BertForTokenClassification | boltuix/NeuroBERT-NER |
ModernBertForTokenClassification | disham993/electrical-ner-ModernBERT-base |
Best practices
Choosing the right model type
Choosing the right model type
- Embedding models: Semantic search, clustering, general similarity
- Cross-encoders: High-accuracy reranking (slower but more accurate)
- ColBERT: Balance between accuracy and efficiency for ranking
- Reward models: RLHF, preference learning
- Classification: Sentiment analysis, topic classification
Performance optimization
Performance optimization
- Batch requests for higher throughput
- Use Matryoshka embeddings with smaller dimensions when appropriate
- Consider model size vs accuracy tradeoffs
- Enable quantization for large models (FP8, INT8)
Integration patterns
Integration patterns
- Use embeddings with vector databases (Pinecone, Weaviate, Milvus)
- Implement two-stage retrieval: embedding for recall, cross-encoder for reranking
- Cache embeddings for frequently accessed content
- Use reward models for RLHF training pipelines
Next steps
Generative models
Text generation and chat models
Multimodal models
Vision and audio models
OpenAI API
Serve pooling models via API
RAG applications
Build retrieval-augmented generation