vLLM returns different output classes depending on the task: text generation, embeddings, classification, or scoring.
RequestOutput
The output of a text completion request.
class RequestOutput:
request_id: str
prompt: str | None
prompt_token_ids: list[int] | None
prompt_logprobs: PromptLogprobs | None
outputs: list[CompletionOutput]
finished: bool
metrics: RequestStateStats | None
num_cached_tokens: int | None
Attributes
Unique identifier for the request.
The prompt string of the request.
Log probabilities of prompt tokens, if requested.
List of completion outputs. Length equals n from SamplingParams.
Whether the request has finished generating.
Performance metrics for the request.
Number of tokens that hit the prefix cache.
Example
from vllm import LLM, SamplingParams
llm = LLM(model="facebook/opt-125m")
sampling_params = SamplingParams(temperature=0.8, max_tokens=100)
outputs = llm.generate(["Hello, my name is"], sampling_params)
for request_output in outputs:
print(f"Request ID: {request_output.request_id}")
print(f"Prompt: {request_output.prompt}")
print(f"Finished: {request_output.finished}")
for completion_output in request_output.outputs:
print(f"Generated text: {completion_output.text}")
print(f"Tokens: {completion_output.token_ids}")
CompletionOutput
A single completion output within a RequestOutput.
class CompletionOutput:
index: int
text: str
token_ids: list[int]
cumulative_logprob: float | None
logprobs: SampleLogprobs | None
finish_reason: str | None
stop_reason: int | str | None
Attributes
Index of this output in the request (0 to n-1).
Token IDs of the generated text.
Cumulative log probability of the generated sequence.
Per-token log probabilities, if requested.
Reason why generation stopped: "stop", "length", or "abort".
The specific stop token or string that caused completion, or None.
PoolingRequestOutput
Base class for pooling model outputs.
class PoolingRequestOutput:
request_id: str
outputs: PoolingOutput
prompt_token_ids: list[int]
num_cached_tokens: int
finished: bool
Attributes
Unique identifier for the request.
The pooling output (type varies by task).
Token IDs of the input prompt.
Number of tokens that hit the prefix cache.
Whether pooling is complete.
EmbeddingRequestOutput
Output for embedding generation tasks.
class EmbeddingRequestOutput(PoolingRequestOutput):
outputs: EmbeddingOutput
EmbeddingOutput
class EmbeddingOutput:
embedding: list[float]
The embedding vector. Length depends on model’s hidden dimension.
Example
from vllm import LLM
llm = LLM(model="sentence-transformers/all-MiniLM-L6-v2", runner="pooling")
outputs = llm.embed(["Hello world", "How are you?"])
for output in outputs:
embedding = output.outputs.embedding
print(f"Embedding dimension: {len(embedding)}")
print(f"Embedding vector: {embedding[:5]}...") # First 5 values
ClassificationRequestOutput
Output for classification tasks.
class ClassificationRequestOutput(PoolingRequestOutput):
outputs: ClassificationOutput
ClassificationOutput
class ClassificationOutput:
probs: list[float]
Probability distribution over classes. Length equals number of classes.
Example
llm = LLM(model="your-classifier", runner="pooling")
outputs = llm.classify(["This is a great product!"])
for output in outputs:
probs = output.outputs.probs
predicted_class = probs.index(max(probs))
print(f"Predicted class: {predicted_class}")
print(f"Probabilities: {probs}")
ScoringRequestOutput
Output for scoring/ranking tasks.
class ScoringRequestOutput(PoolingRequestOutput):
outputs: ScoringOutput
ScoringOutput
class ScoringOutput:
score: float
The similarity or relevance score.
Example
llm = LLM(model="your-reranker", runner="pooling")
query = "machine learning"
documents = [
"ML is a subset of AI",
"Python is a language",
]
pairs = [f"{query} [SEP] {doc}" for doc in documents]
outputs = llm.score(pairs)
for i, output in enumerate(outputs):
print(f"Document {i} score: {output.outputs.score}")