Prerequisites
Operating system
Linux (including WSL on Windows)
Python version
Python 3.10, 3.11, 3.12, or 3.13
Installation
Choose your hardware platform for installation instructions:- NVIDIA CUDA
- AMD ROCm
- Google TPU
If you are using NVIDIA GPUs, you can install vLLM using pip directly.It’s recommended to use uv, a very fast Python environment manager. Install You can also run vLLM commands without creating a permanent environment:
uv and create a new environment:If you prefer conda, you can also use it to manage environments:
Offline batched inference
With vLLM installed, you can start generating text for a list of input prompts (offline batch inference).Basic usage
The core classes you’ll use are:LLM- Main class for running offline inference with the vLLM engineSamplingParams- Parameters for the sampling process
Define prompts and sampling parameters
Create a list of input prompts and configure sampling parameters:By default, vLLM uses sampling parameters from the model’s
generation_config.json on HuggingFace if it exists. This provides optimal results recommended by the model creator.To use vLLM’s default sampling parameters instead, set generation_config="vllm" when creating the LLM instance.Initialize the LLM engine
Create an LLM instance with your chosen model:Generate outputs
Generate text for all prompts with high throughput:Complete example
See the full working example at:examples/offline_inference/basic/basic.py
OpenAI-compatible server
vLLM can be deployed as a server implementing the OpenAI API protocol, allowing it to be a drop-in replacement for OpenAI API applications.Start the server
Launch the vLLM server with a model:- Starts at
http://localhost:8000 - Hosts one model at a time
- Implements OpenAI-compatible endpoints
By default, the server applies
generation_config.json from the HuggingFace model repository. To disable this and use vLLM defaults, pass:API authentication
Enable API key checking:Query the server
List available models:Completions API
Generate completions using the/v1/completions endpoint:
Chat completions API
The chat interface enables dynamic, back-and-forth exchanges:Attention backends
vLLM automatically selects the most performant attention backend for your system. You can manually specify a backend:- Online serving
- Offline inference
Available backends
NVIDIA CUDA
FLASH_ATTN or FLASHINFERAMD ROCm
TRITON_ATTN, ROCM_ATTN, ROCM_AITER_FA, ROCM_AITER_UNIFIED_ATTN, TRITON_MLA, ROCM_AITER_MLA, or ROCM_AITER_TRITON_MLANext steps
Installation guide
Detailed installation for all platforms
Supported models
Browse all compatible models
API reference
Explore complete API documentation
Configuration
Learn about advanced configuration options