Installation

vLLM supports a wide range of hardware platforms for LLM inference and serving. Choose your platform below for specific installation instructions.

Supported platforms

NVIDIA CUDA

Install on NVIDIA GPUs with CUDA support

AMD ROCm

Install on AMD GPUs with ROCm

Google TPU

Install on Google Cloud TPUs

Intel XPU

Install on Intel GPUs

CPU

Install for CPU-only inference

Hardware plugins

Third-party hardware accelerators

Requirements

System requirements

Operating system

Linux (including WSL on Windows)

Python version

Python 3.10, 3.11, 3.12, or 3.13

vLLM does not support Windows natively. To run vLLM on Windows, use the Windows Subsystem for Linux (WSL) with a compatible Linux distribution, or use community-maintained forks like vllm-windows.

NVIDIA CUDA

CUDA requirements

NVIDIA GPU with compute capability 7.0 or higher
CUDA 12.x or 13.x
Driver version compatible with CUDA version

Install with pip (recommended)

Set up Python environment

Create a new environment using uv (recommended):

uv venv --python 3.12 --seed
source .venv/bin/activate

Install uv following the official documentation. It’s significantly faster than pip and conda.

Install vLLM

Install vLLM with automatic CUDA backend detection:

uv pip install vllm --torch-backend=auto

Or specify a specific CUDA version:

# For CUDA 12.6
uv pip install vllm --torch-backend=cu126

# For CUDA 13.0
uv pip install vllm --torch-backend=cu130

Verify installation

python -c "import vllm; print(vllm.__version__)"

Using conda

If you prefer conda for environment management:

conda create -n vllm-env python=3.12 -y
conda activate vllm-env
pip install --upgrade uv
uv pip install vllm --torch-backend=auto

Build from source

For the latest features or custom builds:

Install build dependencies

pip install cmake>=3.26.1 ninja packaging setuptools-scm>=8.0

Clone the repository

git clone https://github.com/vllm-project/vllm.git
cd vllm

Build and install

pip install -e .

Building from source requires CUDA toolkit to be installed. Set CUDA_HOME environment variable if needed:

export CUDA_HOME=/usr/local/cuda

Docker images

Pre-built Docker images are available for NVIDIA CUDA:

# Pull the latest image
docker pull vllm/vllm-openai:latest

# Run with GPU support
docker run --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<your_hf_token>" \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model Qwen/Qwen2.5-1.5B-Instruct

Mount the HuggingFace cache directory to avoid re-downloading models on container restart.

AMD ROCm

ROCm requirements

AMD GPU with ROCm support
ROCm 7.0
glibc >= 2.35
Python 3.12

Install with pip

uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/

The uv package manager gives the extra index higher priority than the default index, which is important for ROCm wheel installation.

Docker images

vLLM provides ROCm-compatible Docker images:

# Pull ROCm image
docker pull vllm/vllm-openai:latest-rocm

# Run with ROCm support
docker run --device /dev/kfd --device /dev/dri \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest-rocm \
    --model Qwen/Qwen2.5-1.5B-Instruct

Google TPU

vLLM supports Google Cloud TPUs through the vllm-tpu package.

Install vLLM for TPU

uv pip install vllm-tpu

For comprehensive TPU installation instructions, including Docker images, building from source, and troubleshooting, refer to the vLLM on TPU documentation.

Intel XPU

vLLM supports Intel Data Center GPUs (formerly Xe).

XPU requirements

Intel Data Center GPU
Intel Extension for PyTorch

Install with pip

uv pip install vllm --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

CPU platforms

vLLM can run on various CPU architectures for inference without GPU acceleration.

Supported CPU architectures

Intel/AMD x86_64

Standard x86-64 processors with AVX2 support

ARM AArch64

ARM 64-bit processors

Apple Silicon

M1, M2, M3 chips via CPU mode

IBM Z (S390X)

IBM mainframe processors

Install for CPU

uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm-cpu

CPU inference is significantly slower than GPU inference. It’s recommended for development, testing, or scenarios where GPUs are not available.

Hardware plugins

vLLM supports third-party hardware through a plugin system. These plugins live outside the main vLLM repository.

Available hardware plugins

Intel Gaudi - Intel’s AI accelerator chips
IBM Spyre - IBM’s AI acceleration platform
Huawei Ascend - Huawei’s NPU platform
And more…

For a complete list of supported hardware, visit the vLLM website. To add new hardware support, contact the team on Slack or via email.

Dependencies

vLLM requires several core dependencies that are automatically installed:

Core dependencies

torch==2.10.0
transformers >= 4.56.0, < 5
tokenizers >= 0.21.1
fastapi[standard] >= 0.115.0
aiohttp >= 3.13.3
openai >= 1.99.1
pydantic >= 2.12.0

CUDA-specific dependencies

ray[cgraph]>=2.48.0
flashinfer-python==0.6.4
nvidia-cutlass-dsl>=4.4.0.dev1
quack-kernels>=0.2.7

Optional dependencies

vLLM provides optional extras for specific use cases:

pip install vllm[bench]

Environment variables

Common environment variables for customizing vLLM behavior:

Variable	Description	Default
`VLLM_USE_MODELSCOPE`	Use ModelScope instead of HuggingFace	`False`
`VLLM_ATTENTION_BACKEND`	Set attention backend	Auto-detected
`CUDA_HOME`	Path to CUDA installation	`/usr/local/cuda`
`HF_TOKEN`	HuggingFace API token	None
`VLLM_API_KEY`	API key for server authentication	None

Verify installation

After installation, verify vLLM works correctly:

Check version

python -c "import vllm; print(f'vLLM version: {vllm.__version__}')"

Run quick test

from vllm import LLM, SamplingParams

# Create a simple LLM instance
llm = LLM(model="facebook/opt-125m")

# Generate text
output = llm.generate("Hello, my name is", SamplingParams(max_tokens=10))
print(output[0].outputs[0].text)

Test server mode

vllm serve facebook/opt-125m --port 8000

In another terminal:

curl http://localhost:8000/v1/models

Troubleshooting

Common issues

CUDA out of memory

Reduce the model size, batch size, or use tensor parallelism:

vllm serve <model> --tensor-parallel-size 2 --max-model-len 2048

Module not found errors

Ensure you activated the correct virtual environment and all dependencies are installed:

pip install --upgrade vllm

Slow model loading

Mount the HuggingFace cache directory or pre-download models:

huggingface-cli download <model_name>

Flash Attention not found

Flash Attention is automatically installed with CUDA builds. For FlashInfer:

pip install flashinfer-python

Next steps

Quickstart guide

Start running inference with vLLM

Supported models

Explore compatible models

Configuration

Learn about configuration options

Deployment

Deploy vLLM in production

Get Started

Core Concepts

Serving

Models

Features

Configuration

Deployment

​Supported platforms

NVIDIA CUDA

AMD ROCm

Google TPU

Intel XPU

CPU

Hardware plugins

​Requirements

​System requirements

Operating system

Python version

​NVIDIA CUDA

​CUDA requirements

​Install with pip (recommended)

​Using conda

​Build from source

​Docker images

​AMD ROCm

​ROCm requirements

​Install with pip

​Docker images

​Google TPU

​Install vLLM for TPU

​Intel XPU

​XPU requirements

​Install with pip

​CPU platforms

​Supported CPU architectures

Intel/AMD x86_64

ARM AArch64

Apple Silicon

IBM Z (S390X)

​Install for CPU

​Hardware plugins

​Available hardware plugins

​Dependencies

​Core dependencies

​CUDA-specific dependencies

​Optional dependencies

​Environment variables

​Verify installation

​Troubleshooting

​Common issues

​Next steps

Quickstart guide

Supported models

Configuration

Deployment

Build docs developers (and LLMs) love

Supported platforms

Requirements

System requirements

NVIDIA CUDA

CUDA requirements

Install with pip (recommended)

Using conda

Build from source

Docker images

AMD ROCm

ROCm requirements

Install with pip

Docker images

Google TPU

Install vLLM for TPU

Intel XPU

XPU requirements

Install with pip

CPU platforms

Supported CPU architectures

Install for CPU

Hardware plugins

Available hardware plugins

Dependencies

Core dependencies

CUDA-specific dependencies

Optional dependencies

Environment variables

Verify installation

Troubleshooting

Common issues

Next steps