LLM class

The LLM class is the main interface for offline inference in vLLM. It provides methods for text generation, chat, embeddings, and other tasks.

Constructor

from vllm import LLM

llm = LLM(
    model: str,
    tokenizer: str | None = None,
    tokenizer_mode: str = "auto",
    trust_remote_code: bool = False,
    tensor_parallel_size: int = 1,
    dtype: str = "auto",
    quantization: str | None = None,
    revision: str | None = None,
    seed: int = 0,
    gpu_memory_utilization: float = 0.9,
    swap_space: float = 4,
    enforce_eager: bool = False,
    max_model_len: int | None = None,
    **kwargs
)

Parameters

model

str

required

The name or path of a HuggingFace Transformers model.

tokenizer

str | None

default:"None"

The name or path of a HuggingFace Transformers tokenizer. If None, uses the model path.

tokenizer_mode

str

default:"auto"

The tokenizer mode. "auto" will use the fast tokenizer if available, and "slow" will always use the slow tokenizer.

skip_tokenizer_init

bool

default:"False"

If true, skip initialization of tokenizer and detokenizer. Expect valid prompt_token_ids and None for prompt from the input.

trust_remote_code

bool

default:"False"

Trust remote code (e.g., from HuggingFace) when downloading the model and tokenizer.

tensor_parallel_size

int

default:"1"

The number of GPUs to use for distributed execution with tensor parallelism.

dtype

str

default:"auto"

The data type for the model weights and activations. Currently supports float32, float16, and bfloat16. If auto, uses the dtype attribute from the model config.

quantization

str | None

default:"None"

The method used to quantize the model weights. Supports "awq", "gptq", and "fp8" (experimental).

revision

str | None

default:"None"

The specific model version to use. Can be a branch name, tag name, or commit id.

seed

int

default:"0"

The seed to initialize the random number generator for sampling.

gpu_memory_utilization

float

default:"0.9"

The ratio (between 0 and 1) of GPU memory to reserve for the model weights, activations, and KV cache. Higher values increase the KV cache size and improve throughput.

swap_space

float

default:"4"

The size (GiB) of CPU memory per GPU to use as swap space for temporarily storing request states.

enforce_eager

bool

default:"False"

Whether to enforce eager execution. If True, disables CUDA graph and always executes the model in eager mode.

max_model_len

int | None

default:"None"

Maximum sequence length supported by the model. If None, uses the model’s config value.

Methods

generate

llm.generate(
    prompts: PromptType | Sequence[PromptType],
    sampling_params: SamplingParams | Sequence[SamplingParams] | None = None,
    use_tqdm: bool = True,
    lora_request: LoRARequest | None = None,
) -> list[RequestOutput]

Generates completions for the input prompts. Automatically batches prompts for optimal performance.

prompts

str | list[str] | list[int] | list[list[int]]

required

The prompts to the LLM. Can be a string, list of strings, token IDs, or list of token ID sequences.

sampling_params

SamplingParams | list[SamplingParams] | None

default:"None"

The sampling parameters for text generation. If None, uses default parameters. Can be a single value applied to all prompts or a list matching the prompts.

use_tqdm

bool

default:"True"

If True, shows a tqdm progress bar during generation.

lora_request

LoRARequest | None

default:"None"

LoRA request to use for generation, if any.

Returns: A list of RequestOutput objects containing the generated completions in the same order as the input prompts.

chat

llm.chat(
    messages: list[ChatCompletionMessageParam] | Sequence[list[ChatCompletionMessageParam]],
    sampling_params: SamplingParams | Sequence[SamplingParams] | None = None,
    use_tqdm: bool = True,
    lora_request: LoRARequest | None = None,
    chat_template: str | None = None,
    add_generation_prompt: bool = True,
) -> list[RequestOutput]

Generates responses for chat conversations. Converts conversations to prompts using the model’s chat template.

messages

list[dict] | list[list[dict]]

required

A conversation or list of conversations. Each conversation is a list of messages with ‘role’ and ‘content’ keys.

sampling_params

SamplingParams | list[SamplingParams] | None

default:"None"

The sampling parameters for text generation.

chat_template

str | None

default:"None"

The template to use for structuring the chat. If not provided, uses the model’s default chat template.

add_generation_prompt

bool

default:"True"

If True, adds a generation template to each message.

Returns: A list of RequestOutput objects containing the generated responses.

encode

llm.encode(
    prompts: PromptType | Sequence[PromptType],
    pooling_params: PoolingParams | Sequence[PoolingParams] | None = None,
    use_tqdm: bool = True,
    lora_request: LoRARequest | None = None,
    pooling_task: str | None = None,
) -> list[PoolingRequestOutput]

Applies pooling to hidden states corresponding to input prompts. Used for embeddings and other pooling tasks.

prompts

str | list[str]

required

The prompts to encode.

pooling_params

PoolingParams | list[PoolingParams] | None

default:"None"

The pooling parameters. If None, uses default pooling parameters.

pooling_task

str | None

default:"None"

The pooling task to perform. Must be one of: "embed", "classify", "score", "token_embed", or "token_classify".

Returns: A list of PoolingRequestOutput objects containing the pooled hidden states.

embed

llm.embed(
    prompts: PromptType | Sequence[PromptType],
    pooling_params: PoolingParams | Sequence[PoolingParams] | None = None,
    use_tqdm: bool = True,
) -> list[EmbeddingRequestOutput]

Generates embedding vectors for each prompt. Convenience method that calls encode() with pooling_task="embed". Returns: A list of EmbeddingRequestOutput objects containing the embedding vectors.

Example usage

from vllm import LLM, SamplingParams

# Initialize the LLM
llm = LLM(model="facebook/opt-125m")

# Generate text
prompts = [
    "Hello, my name is",
    "The president of the United States is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=100)
outputs = llm.generate(prompts, sampling_params)

# Print outputs
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

SamplingParams - Configure text generation parameters
PoolingParams - Configure pooling parameters
RequestOutput - Output format for completions

Python API

REST API

CLI Reference

Constructor

Parameters

Methods

generate

chat

encode

embed

Example usage

Build docs developers (and LLMs) love

Python API

REST API

CLI Reference

​Constructor

​Parameters

​Methods

​generate

​chat

​encode

​embed

​Example usage

​Related

Build docs developers (and LLMs) love

Constructor

Parameters

Methods

generate

chat

encode

embed

Example usage

Related