Supported platforms
NVIDIA CUDA
Install on NVIDIA GPUs with CUDA support
AMD ROCm
Install on AMD GPUs with ROCm
Google TPU
Install on Google Cloud TPUs
Intel XPU
Install on Intel GPUs
CPU
Install for CPU-only inference
Hardware plugins
Third-party hardware accelerators
Requirements
System requirements
Operating system
Linux (including WSL on Windows)
Python version
Python 3.10, 3.11, 3.12, or 3.13
vLLM does not support Windows natively. To run vLLM on Windows, use the Windows Subsystem for Linux (WSL) with a compatible Linux distribution, or use community-maintained forks like vllm-windows.
NVIDIA CUDA
CUDA requirements
- NVIDIA GPU with compute capability 7.0 or higher
- CUDA 12.x or 13.x
- Driver version compatible with CUDA version
Install with pip (recommended)
Using conda
If you prefer conda for environment management:Build from source
For the latest features or custom builds:Docker images
Pre-built Docker images are available for NVIDIA CUDA:AMD ROCm
ROCm requirements
- AMD GPU with ROCm support
- ROCm 7.0
glibc >= 2.35- Python 3.12
Install with pip
Docker images
vLLM provides ROCm-compatible Docker images:Google TPU
vLLM supports Google Cloud TPUs through thevllm-tpu package.
Install vLLM for TPU
For comprehensive TPU installation instructions, including Docker images, building from source, and troubleshooting, refer to the vLLM on TPU documentation.
Intel XPU
vLLM supports Intel Data Center GPUs (formerly Xe).XPU requirements
- Intel Data Center GPU
- Intel Extension for PyTorch
Install with pip
CPU platforms
vLLM can run on various CPU architectures for inference without GPU acceleration.Supported CPU architectures
Intel/AMD x86_64
Standard x86-64 processors with AVX2 support
ARM AArch64
ARM 64-bit processors
Apple Silicon
M1, M2, M3 chips via CPU mode
IBM Z (S390X)
IBM mainframe processors
Install for CPU
Hardware plugins
vLLM supports third-party hardware through a plugin system. These plugins live outside the main vLLM repository.Available hardware plugins
- Intel Gaudi - Intel’s AI accelerator chips
- IBM Spyre - IBM’s AI acceleration platform
- Huawei Ascend - Huawei’s NPU platform
- And more…
For a complete list of supported hardware, visit the vLLM website. To add new hardware support, contact the team on Slack or via email.
Dependencies
vLLM requires several core dependencies that are automatically installed:Core dependencies
CUDA-specific dependencies
Optional dependencies
vLLM provides optional extras for specific use cases:Environment variables
Common environment variables for customizing vLLM behavior:| Variable | Description | Default |
|---|---|---|
VLLM_USE_MODELSCOPE | Use ModelScope instead of HuggingFace | False |
VLLM_ATTENTION_BACKEND | Set attention backend | Auto-detected |
CUDA_HOME | Path to CUDA installation | /usr/local/cuda |
HF_TOKEN | HuggingFace API token | None |
VLLM_API_KEY | API key for server authentication | None |
Verify installation
After installation, verify vLLM works correctly:Troubleshooting
Common issues
CUDA out of memory
CUDA out of memory
Reduce the model size, batch size, or use tensor parallelism:
Module not found errors
Module not found errors
Ensure you activated the correct virtual environment and all dependencies are installed:
Slow model loading
Slow model loading
Mount the HuggingFace cache directory or pre-download models:
Flash Attention not found
Flash Attention not found
Flash Attention is automatically installed with CUDA builds. For FlashInfer:
Next steps
Quickstart guide
Start running inference with vLLM
Supported models
Explore compatible models
Configuration
Learn about configuration options
Deployment
Deploy vLLM in production