LLM class is the main interface for offline inference in vLLM. It provides methods for text generation, chat, embeddings, and other tasks.
Constructor
Parameters
The name or path of a HuggingFace Transformers model.
The name or path of a HuggingFace Transformers tokenizer. If None, uses the model path.
The tokenizer mode.
"auto" will use the fast tokenizer if available, and "slow" will always use the slow tokenizer.If true, skip initialization of tokenizer and detokenizer. Expect valid
prompt_token_ids and None for prompt from the input.Trust remote code (e.g., from HuggingFace) when downloading the model and tokenizer.
The number of GPUs to use for distributed execution with tensor parallelism.
The data type for the model weights and activations. Currently supports
float32, float16, and bfloat16. If auto, uses the dtype attribute from the model config.The method used to quantize the model weights. Supports
"awq", "gptq", and "fp8" (experimental).The specific model version to use. Can be a branch name, tag name, or commit id.
The seed to initialize the random number generator for sampling.
The ratio (between 0 and 1) of GPU memory to reserve for the model weights, activations, and KV cache. Higher values increase the KV cache size and improve throughput.
The size (GiB) of CPU memory per GPU to use as swap space for temporarily storing request states.
Whether to enforce eager execution. If True, disables CUDA graph and always executes the model in eager mode.
Maximum sequence length supported by the model. If None, uses the model’s config value.
Methods
generate
The prompts to the LLM. Can be a string, list of strings, token IDs, or list of token ID sequences.
The sampling parameters for text generation. If None, uses default parameters. Can be a single value applied to all prompts or a list matching the prompts.
If True, shows a tqdm progress bar during generation.
LoRA request to use for generation, if any.
RequestOutput objects containing the generated completions in the same order as the input prompts.
chat
A conversation or list of conversations. Each conversation is a list of messages with ‘role’ and ‘content’ keys.
The sampling parameters for text generation.
The template to use for structuring the chat. If not provided, uses the model’s default chat template.
If True, adds a generation template to each message.
RequestOutput objects containing the generated responses.
encode
The prompts to encode.
The pooling parameters. If None, uses default pooling parameters.
The pooling task to perform. Must be one of:
"embed", "classify", "score", "token_embed", or "token_classify".PoolingRequestOutput objects containing the pooled hidden states.
embed
encode() with pooling_task="embed".
Returns: A list of EmbeddingRequestOutput objects containing the embedding vectors.
Example usage
Related
- SamplingParams - Configure text generation parameters
- PoolingParams - Configure pooling parameters
- RequestOutput - Output format for completions