Skip to main content
vLLM supports pooling models including embedding models, classification models, reward models, and cross-encoders. These models extract representations from text rather than generating new text.

How pooling models work

In vLLM, pooling models implement the VllmModelForPooling interface. These models use a Pooler to extract final hidden states from input before returning them.
Pooling model support is currently focused on convenience. We plan to optimize pooling models in vLLM. Please comment on #21796 if you have suggestions.

Configuration

Model runner

Run a model in pooling mode via --runner pooling.
In most cases, you don’t need to set this as vLLM automatically detects the appropriate model runner via --runner auto.

Model conversion

vLLM can adapt models for various pooling tasks via --convert <type>:
ArchitectureConvert TypeSupported Tasks
*Model, *EmbeddingModelembedtoken_embed, embed
*RewardModelembedtoken_embed, embed
*ClassificationModelclassifytoken_classify, classify, score

Pooling tasks

Each pooling model supports one or more tasks:
TaskAPIs
embedLLM.embed(...), LLM.encode(..., pooling_task="embed")
classifyLLM.classify(...), LLM.encode(..., pooling_task="classify")
scoreLLM.score(...)
token_classifyLLM.reward(...), LLM.encode(..., pooling_task="token_classify")
token_embedLLM.encode(..., pooling_task="token_embed")

Embedding models

Embedding models convert text into dense vector representations, useful for semantic search, clustering, and retrieval.

Basic usage

from vllm import LLM

llm = LLM(model="intfloat/e5-small", runner="pooling")
outputs = llm.embed("Hello, my name is")

embeds = outputs[0].outputs.embedding
print(f"Embeddings: {embeds!r} (size={len(embeds)})")

Supported architectures

ArchitectureModelsExample
BertModelBERT-basedBAAI/bge-base-en-v1.5
Gemma2ModelGemma 2-basedBAAI/bge-multilingual-gemma2
LlamaModelLlama-basedintfloat/e5-mistral-7b-instruct
Qwen2ModelQwen2-basedAlibaba-NLP/gte-Qwen2-7B-instruct
Qwen3ModelQwen3-basedQwen/Qwen3-Embedding-0.6B
NomicBertModelNomic BERTnomic-ai/nomic-embed-text-v1
ModernBertModelModernBERTAlibaba-NLP/gte-modernbert-base
GteNewModelmGTE-TRMAlibaba-NLP/gte-multilingual-base

Matryoshka embeddings

Matryoshka Representation Learning (MRL) allows trading off performance for cost by using smaller embedding dimensions.
from vllm import LLM, PoolingParams

llm = LLM(
    model="jinaai/jina-embeddings-v3",
    runner="pooling",
    trust_remote_code=True,
)
outputs = llm.embed(
    ["Follow the white rabbit."],
    pooling_params=PoolingParams(dimensions=32),
)
print(outputs[0].outputs)
Not all embedding models support Matryoshka. vLLM returns an error if you attempt to change dimensions on unsupported models.To manually enable for compatible models:
vllm serve model-name --hf-overrides '{"is_matryoshka": true}'

Online serving

vllm serve intfloat/e5-small
curl http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Hello, my name is",
    "model": "intfloat/e5-small"
  }'

Classification models

Classification models output probability distributions over predefined classes.

Basic usage

from vllm import LLM

llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", runner="pooling")
outputs = llm.classify("Hello, my name is")

probs = outputs[0].outputs.probs
print(f"Class Probabilities: {probs!r} (size={len(probs)})")

Supported architectures

ArchitectureExample
JambaForSequenceClassificationai21labs/Jamba-tiny-reward-dev
GPT2ForSequenceClassificationnie3e/sentiment-polish-gpt2-small
BertForSequenceClassificationBERT classification models
RobertaForSequenceClassificationRoBERTa classification models

Cross-encoder / Reranker models

Cross-encoders score the relevance between query-document pairs, used for reranking in RAG systems.

Basic usage

from vllm import LLM

llm = LLM(model="BAAI/bge-reranker-v2-m3", runner="pooling")
outputs = llm.score(
    "What is the capital of France?",
    "The capital of Brazil is Brasilia.",
)

score = outputs[0].outputs.score
print(f"Score: {score}")

Supported architectures

ArchitectureModelsExample
BertForSequenceClassificationBERT cross-encoderscross-encoder/ms-marco-MiniLM-L-6-v2
XLMRobertaForSequenceClassificationXLM-RoBERTaBAAI/bge-reranker-v2-m3
Qwen2ForSequenceClassificationQwen2 rerankermixedbread-ai/mxbai-rerank-base-v2
Qwen3ForSequenceClassificationQwen3 rerankerQwen/Qwen3-Reranker-0.6B
GemmaForSequenceClassificationGemma rerankerBAAI/bge-reranker-v2-gemma
LlamaBidirectionalForSequenceClassificationLlama bidirectionalnvidia/llama-nemotron-rerank-1b-v2
Some reranker models require specific prompt formats. Check the model’s documentation and use score templates when needed. Example templates are available in examples/pooling/score/template/.

Online serving

vllm serve BAAI/bge-reranker-v2-m3
curl http://localhost:8000/score \
  -H "Content-Type: application/json" \
  -d '{
    "model": "BAAI/bge-reranker-v2-m3",
    "text_1": "What is machine learning?",
    "text_2": ["ML is a subset of AI.", "The weather is sunny."]
  }'

Reward models

Reward models assign scalar scores to text, used in reinforcement learning from human feedback (RLHF).

Basic usage

from vllm import LLM

llm = LLM(
    model="internlm/internlm2-1_8b-reward",
    runner="pooling",
    trust_remote_code=True
)
outputs = llm.reward("Hello, my name is")

data = outputs[0].outputs.data
print(f"Data: {data!r}")

Supported architectures

ArchitectureModelsExample
InternLM2ForRewardModelInternLM2-basedinternlm/internlm2-7b-reward
LlamaForCausalLMLlama-based PRMpeiyi9979/math-shepherd-mistral-7b-prm
Qwen2ForRewardModelQwen2-basedQwen/Qwen2.5-Math-RM-72B
Qwen2ForProcessRewardModelQwen2 PRMQwen/Qwen2.5-Math-PRM-7B
For process-supervised reward models, set the pooling config explicitly:
llm = LLM(
    model="peiyi9979/math-shepherd-mistral-7b-prm",
    pooler_config='{"pooling_type": "STEP", "step_tag_id": 123, "returned_token_ids": [456, 789]}'
)

ColBERT late interaction models

ColBERT uses per-token embeddings and MaxSim scoring for document ranking, providing better accuracy than single-vector embeddings while being more efficient than cross-encoders.

Supported architectures

ArchitectureBackboneExample
HF_ColBERTBERTanswerdotai/answerai-colbert-small-v1
ColBERTModernBertModelModernBERTlightonai/GTE-ModernColBERT-v1
ColBERTJinaRobertaModelJina XLM-RoBERTajinaai/jina-colbert-v2

Usage

BERT-based models work out of the box:
vllm serve answerdotai/answerai-colbert-small-v1
For non-BERT backbones, specify the architecture:
vllm serve lightonai/GTE-ModernColBERT-v1 \
  --hf-overrides '{"architectures": ["ColBERTModernBertModel"]}'
Rerank documents:
curl http://localhost:8000/rerank \
  -H "Content-Type: application/json" \
  -d '{
    "model": "answerdotai/answerai-colbert-small-v1",
    "query": "What is machine learning?",
    "documents": [
      "Machine learning is a subset of AI.",
      "Python is a programming language."
    ]
  }'

Token classification models

Models for Named Entity Recognition (NER) and other token-level tasks.

Supported architectures

ArchitectureExample
BertForTokenClassificationboltuix/NeuroBERT-NER
ModernBertForTokenClassificationdisham993/electrical-ner-ModernBERT-base

Best practices

  • Embedding models: Semantic search, clustering, general similarity
  • Cross-encoders: High-accuracy reranking (slower but more accurate)
  • ColBERT: Balance between accuracy and efficiency for ranking
  • Reward models: RLHF, preference learning
  • Classification: Sentiment analysis, topic classification
  • Batch requests for higher throughput
  • Use Matryoshka embeddings with smaller dimensions when appropriate
  • Consider model size vs accuracy tradeoffs
  • Enable quantization for large models (FP8, INT8)
  • Use embeddings with vector databases (Pinecone, Weaviate, Milvus)
  • Implement two-stage retrieval: embedding for recall, cross-encoder for reranking
  • Cache embeddings for frequently accessed content
  • Use reward models for RLHF training pipelines

Next steps

Generative models

Text generation and chat models

Multimodal models

Vision and audio models

OpenAI API

Serve pooling models via API

RAG applications

Build retrieval-augmented generation

Build docs developers (and LLMs) love