LLM class. This is ideal for scenarios where you need to process a large number of prompts efficiently without the overhead of an HTTP server.
Quick start
Initialize the vLLM engine with a model and generate completions:LLM class automatically downloads the model from HuggingFace and initializes it with optimized configurations for your hardware.
Model types and APIs
The available APIs depend on the model type:- Generative models - Output logprobs which are sampled to obtain the final text. Use
llm.generate()orllm.chat()methods. - Pooling models - Output hidden states directly. Use for embeddings, classification, and scoring tasks.
Generation with CLI arguments
You can use engine arguments directly from the command line:Chat interface
For chat models with a chat template, use thellm.chat() method:
Custom chat templates
You can optionally provide a custom chat template:Async streaming inference
For streaming token-by-token output, use theAsyncLLM engine:
Data parallel batch inference
For large-scale batch processing across multiple GPUs, use data parallelism:Ray Data LLM API
Ray Data provides advanced capabilities for large-scale batch inference:- Streaming execution for datasets exceeding cluster memory
- Automatic sharding, load balancing, and autoscaling
- Continuous batching for maximum GPU utilization
- Support for tensor and pipeline parallelism
- Reading/writing popular file formats and cloud storage
Common configurations
- Memory optimization
- Quantization
- Tensor parallelism
The
LLM class is optimized for throughput over latency. For low-latency serving with concurrent requests, use the OpenAI-compatible server.Examples
Explore full examples in the vLLM repository:- Basic offline inference - vllm/entrypoints/offline_inference/basic/basic.py:19
- Chat interface - vllm/entrypoints/offline_inference/basic/chat.py:34
- Async streaming - vllm/entrypoints/offline_inference/async_llm_streaming.py:45
- Data parallel inference - vllm/entrypoints/offline_inference/data_parallel.py:136