Pooling models

vLLM supports pooling models including embedding models, classification models, reward models, and cross-encoders. These models extract representations from text rather than generating new text.

How pooling models work

In vLLM, pooling models implement the VllmModelForPooling interface. These models use a Pooler to extract final hidden states from input before returning them.

Pooling model support is currently focused on convenience. We plan to optimize pooling models in vLLM. Please comment on #21796 if you have suggestions.

Configuration

Model runner

Run a model in pooling mode via --runner pooling.

In most cases, you don’t need to set this as vLLM automatically detects the appropriate model runner via --runner auto.

Model conversion

vLLM can adapt models for various pooling tasks via --convert <type>:

Architecture	Convert Type	Supported Tasks
`Model`, `EmbeddingModel`	`embed`	`token_embed`, `embed`
`*RewardModel`	`embed`	`token_embed`, `embed`
`*ClassificationModel`	`classify`	`token_classify`, `classify`, `score`

Pooling tasks

Each pooling model supports one or more tasks:

Task	APIs
`embed`	`LLM.embed(...)`, `LLM.encode(..., pooling_task="embed")`
`classify`	`LLM.classify(...)`, `LLM.encode(..., pooling_task="classify")`
`score`	`LLM.score(...)`
`token_classify`	`LLM.reward(...)`, `LLM.encode(..., pooling_task="token_classify")`
`token_embed`	`LLM.encode(..., pooling_task="token_embed")`

Embedding models

Embedding models convert text into dense vector representations, useful for semantic search, clustering, and retrieval.

Basic usage

from vllm import LLM

llm = LLM(model="intfloat/e5-small", runner="pooling")
outputs = llm.embed("Hello, my name is")

embeds = outputs[0].outputs.embedding
print(f"Embeddings: {embeds!r} (size={len(embeds)})")

Supported architectures

Architecture	Models	Example
`BertModel`	BERT-based	`BAAI/bge-base-en-v1.5`
`Gemma2Model`	Gemma 2-based	`BAAI/bge-multilingual-gemma2`
`LlamaModel`	Llama-based	`intfloat/e5-mistral-7b-instruct`
`Qwen2Model`	Qwen2-based	`Alibaba-NLP/gte-Qwen2-7B-instruct`
`Qwen3Model`	Qwen3-based	`Qwen/Qwen3-Embedding-0.6B`
`NomicBertModel`	Nomic BERT	`nomic-ai/nomic-embed-text-v1`
`ModernBertModel`	ModernBERT	`Alibaba-NLP/gte-modernbert-base`
`GteNewModel`	mGTE-TRM	`Alibaba-NLP/gte-multilingual-base`

Matryoshka embeddings

Matryoshka Representation Learning (MRL) allows trading off performance for cost by using smaller embedding dimensions.

from vllm import LLM, PoolingParams

llm = LLM(
    model="jinaai/jina-embeddings-v3",
    runner="pooling",
    trust_remote_code=True,
)
outputs = llm.embed(
    ["Follow the white rabbit."],
    pooling_params=PoolingParams(dimensions=32),
)
print(outputs[0].outputs)

Not all embedding models support Matryoshka. vLLM returns an error if you attempt to change dimensions on unsupported models.To manually enable for compatible models:

vllm serve model-name --hf-overrides '{"is_matryoshka": true}'

Online serving

vllm serve intfloat/e5-small

curl http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Hello, my name is",
    "model": "intfloat/e5-small"
  }'

Classification models

Classification models output probability distributions over predefined classes.

Basic usage

from vllm import LLM

llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", runner="pooling")
outputs = llm.classify("Hello, my name is")

probs = outputs[0].outputs.probs
print(f"Class Probabilities: {probs!r} (size={len(probs)})")

Supported architectures

Architecture	Example
`JambaForSequenceClassification`	`ai21labs/Jamba-tiny-reward-dev`
`GPT2ForSequenceClassification`	`nie3e/sentiment-polish-gpt2-small`
`BertForSequenceClassification`	BERT classification models
`RobertaForSequenceClassification`	RoBERTa classification models

Cross-encoder / Reranker models

Cross-encoders score the relevance between query-document pairs, used for reranking in RAG systems.

Basic usage

from vllm import LLM

llm = LLM(model="BAAI/bge-reranker-v2-m3", runner="pooling")
outputs = llm.score(
    "What is the capital of France?",
    "The capital of Brazil is Brasilia.",
)

score = outputs[0].outputs.score
print(f"Score: {score}")

Supported architectures

Architecture	Models	Example
`BertForSequenceClassification`	BERT cross-encoders	`cross-encoder/ms-marco-MiniLM-L-6-v2`
`XLMRobertaForSequenceClassification`	XLM-RoBERTa	`BAAI/bge-reranker-v2-m3`
`Qwen2ForSequenceClassification`	Qwen2 reranker	`mixedbread-ai/mxbai-rerank-base-v2`
`Qwen3ForSequenceClassification`	Qwen3 reranker	`Qwen/Qwen3-Reranker-0.6B`
`GemmaForSequenceClassification`	Gemma reranker	`BAAI/bge-reranker-v2-gemma`
`LlamaBidirectionalForSequenceClassification`	Llama bidirectional	`nvidia/llama-nemotron-rerank-1b-v2`

Some reranker models require specific prompt formats. Check the model’s documentation and use score templates when needed. Example templates are available in examples/pooling/score/template/.

Online serving

vllm serve BAAI/bge-reranker-v2-m3

curl http://localhost:8000/score \
  -H "Content-Type: application/json" \
  -d '{
    "model": "BAAI/bge-reranker-v2-m3",
    "text_1": "What is machine learning?",
    "text_2": ["ML is a subset of AI.", "The weather is sunny."]
  }'

Reward models

Reward models assign scalar scores to text, used in reinforcement learning from human feedback (RLHF).

Basic usage

from vllm import LLM

llm = LLM(
    model="internlm/internlm2-1_8b-reward",
    runner="pooling",
    trust_remote_code=True
)
outputs = llm.reward("Hello, my name is")

data = outputs[0].outputs.data
print(f"Data: {data!r}")

Supported architectures

Architecture	Models	Example
`InternLM2ForRewardModel`	InternLM2-based	`internlm/internlm2-7b-reward`
`LlamaForCausalLM`	Llama-based PRM	`peiyi9979/math-shepherd-mistral-7b-prm`
`Qwen2ForRewardModel`	Qwen2-based	`Qwen/Qwen2.5-Math-RM-72B`
`Qwen2ForProcessRewardModel`	Qwen2 PRM	`Qwen/Qwen2.5-Math-PRM-7B`

For process-supervised reward models, set the pooling config explicitly:

llm = LLM(
    model="peiyi9979/math-shepherd-mistral-7b-prm",
    pooler_config='{"pooling_type": "STEP", "step_tag_id": 123, "returned_token_ids": [456, 789]}'
)

ColBERT late interaction models

ColBERT uses per-token embeddings and MaxSim scoring for document ranking, providing better accuracy than single-vector embeddings while being more efficient than cross-encoders.

Supported architectures

Architecture	Backbone	Example
`HF_ColBERT`	BERT	`answerdotai/answerai-colbert-small-v1`
`ColBERTModernBertModel`	ModernBERT	`lightonai/GTE-ModernColBERT-v1`
`ColBERTJinaRobertaModel`	Jina XLM-RoBERTa	`jinaai/jina-colbert-v2`

Usage

BERT-based models work out of the box:

vllm serve answerdotai/answerai-colbert-small-v1

For non-BERT backbones, specify the architecture:

vllm serve lightonai/GTE-ModernColBERT-v1 \
  --hf-overrides '{"architectures": ["ColBERTModernBertModel"]}'

Rerank documents:

curl http://localhost:8000/rerank \
  -H "Content-Type: application/json" \
  -d '{
    "model": "answerdotai/answerai-colbert-small-v1",
    "query": "What is machine learning?",
    "documents": [
      "Machine learning is a subset of AI.",
      "Python is a programming language."
    ]
  }'

Token classification models

Models for Named Entity Recognition (NER) and other token-level tasks.

Supported architectures

Architecture	Example
`BertForTokenClassification`	`boltuix/NeuroBERT-NER`
`ModernBertForTokenClassification`	`disham993/electrical-ner-ModernBERT-base`

Best practices

Choosing the right model type

Embedding models: Semantic search, clustering, general similarity
Cross-encoders: High-accuracy reranking (slower but more accurate)
ColBERT: Balance between accuracy and efficiency for ranking
Reward models: RLHF, preference learning
Classification: Sentiment analysis, topic classification

Performance optimization

Batch requests for higher throughput
Use Matryoshka embeddings with smaller dimensions when appropriate
Consider model size vs accuracy tradeoffs
Enable quantization for large models (FP8, INT8)

Integration patterns

Use embeddings with vector databases (Pinecone, Weaviate, Milvus)
Implement two-stage retrieval: embedding for recall, cross-encoder for reranking
Cache embeddings for frequently accessed content
Use reward models for RLHF training pipelines

Next steps

Generative models

Text generation and chat models

Multimodal models

Vision and audio models

OpenAI API

Serve pooling models via API

RAG applications

Build retrieval-augmented generation

Get Started

Core Concepts

Serving

Models

Features

Configuration

Deployment

Pooling models

How pooling models work

Configuration

Model runner

Model conversion

Pooling tasks

Embedding models

Basic usage

Supported architectures

Matryoshka embeddings

Online serving

Classification models

Basic usage

Supported architectures

Cross-encoder / Reranker models

Basic usage

Supported architectures

Online serving

Reward models

Basic usage

Supported architectures

ColBERT late interaction models

Supported architectures

Usage

Token classification models

Supported architectures

Best practices

Next steps

Generative models

Multimodal models

OpenAI API

RAG applications

Build docs developers (and LLMs) love

Get Started

Core Concepts

Serving

Models

Features

Configuration

Deployment

​How pooling models work

​Configuration

​Model runner

​Model conversion

​Pooling tasks

​Embedding models

​Basic usage

​Supported architectures

​Matryoshka embeddings

​Online serving

​Classification models

​Basic usage

​Supported architectures

​Cross-encoder / Reranker models

​Basic usage

​Supported architectures

​Online serving

​Reward models

​Basic usage

​Supported architectures

​ColBERT late interaction models

​Supported architectures

​Usage

​Token classification models

​Supported architectures

​Best practices

​Next steps

Generative models

Multimodal models

OpenAI API

RAG applications

Build docs developers (and LLMs) love

How pooling models work

Configuration

Model runner

Model conversion

Pooling tasks

Embedding models

Basic usage

Supported architectures

Matryoshka embeddings

Online serving

Classification models

Basic usage

Supported architectures

Cross-encoder / Reranker models

Basic usage

Supported architectures

Online serving

Reward models

Basic usage

Supported architectures

ColBERT late interaction models

Supported architectures

Usage

Token classification models

Supported architectures

Best practices

Next steps