Skip to main content
vLLM supports a wide range of hardware platforms for LLM inference and serving. Choose your platform below for specific installation instructions.

Supported platforms

NVIDIA CUDA

Install on NVIDIA GPUs with CUDA support

AMD ROCm

Install on AMD GPUs with ROCm

Google TPU

Install on Google Cloud TPUs

Intel XPU

Install on Intel GPUs

CPU

Install for CPU-only inference

Hardware plugins

Third-party hardware accelerators

Requirements

System requirements

Operating system

Linux (including WSL on Windows)

Python version

Python 3.10, 3.11, 3.12, or 3.13
vLLM does not support Windows natively. To run vLLM on Windows, use the Windows Subsystem for Linux (WSL) with a compatible Linux distribution, or use community-maintained forks like vllm-windows.

NVIDIA CUDA

CUDA requirements

  • NVIDIA GPU with compute capability 7.0 or higher
  • CUDA 12.x or 13.x
  • Driver version compatible with CUDA version
1

Set up Python environment

Create a new environment using uv (recommended):
uv venv --python 3.12 --seed
source .venv/bin/activate
Install uv following the official documentation. It’s significantly faster than pip and conda.
2

Install vLLM

Install vLLM with automatic CUDA backend detection:
uv pip install vllm --torch-backend=auto
Or specify a specific CUDA version:
# For CUDA 12.6
uv pip install vllm --torch-backend=cu126

# For CUDA 13.0
uv pip install vllm --torch-backend=cu130
3

Verify installation

python -c "import vllm; print(vllm.__version__)"

Using conda

If you prefer conda for environment management:
conda create -n vllm-env python=3.12 -y
conda activate vllm-env
pip install --upgrade uv
uv pip install vllm --torch-backend=auto

Build from source

For the latest features or custom builds:
1

Install build dependencies

pip install cmake>=3.26.1 ninja packaging setuptools-scm>=8.0
2

Clone the repository

git clone https://github.com/vllm-project/vllm.git
cd vllm
3

Build and install

pip install -e .
Building from source requires CUDA toolkit to be installed. Set CUDA_HOME environment variable if needed:
export CUDA_HOME=/usr/local/cuda

Docker images

Pre-built Docker images are available for NVIDIA CUDA:
# Pull the latest image
docker pull vllm/vllm-openai:latest

# Run with GPU support
docker run --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<your_hf_token>" \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model Qwen/Qwen2.5-1.5B-Instruct
Mount the HuggingFace cache directory to avoid re-downloading models on container restart.

AMD ROCm

ROCm requirements

  • AMD GPU with ROCm support
  • ROCm 7.0
  • glibc >= 2.35
  • Python 3.12

Install with pip

uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/
The uv package manager gives the extra index higher priority than the default index, which is important for ROCm wheel installation.

Docker images

vLLM provides ROCm-compatible Docker images:
# Pull ROCm image
docker pull vllm/vllm-openai:latest-rocm

# Run with ROCm support
docker run --device /dev/kfd --device /dev/dri \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest-rocm \
    --model Qwen/Qwen2.5-1.5B-Instruct

Google TPU

vLLM supports Google Cloud TPUs through the vllm-tpu package.

Install vLLM for TPU

uv pip install vllm-tpu
For comprehensive TPU installation instructions, including Docker images, building from source, and troubleshooting, refer to the vLLM on TPU documentation.

Intel XPU

vLLM supports Intel Data Center GPUs (formerly Xe).

XPU requirements

  • Intel Data Center GPU
  • Intel Extension for PyTorch

Install with pip

uv pip install vllm --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

CPU platforms

vLLM can run on various CPU architectures for inference without GPU acceleration.

Supported CPU architectures

Intel/AMD x86_64

Standard x86-64 processors with AVX2 support

ARM AArch64

ARM 64-bit processors

Apple Silicon

M1, M2, M3 chips via CPU mode

IBM Z (S390X)

IBM mainframe processors

Install for CPU

uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm-cpu
CPU inference is significantly slower than GPU inference. It’s recommended for development, testing, or scenarios where GPUs are not available.

Hardware plugins

vLLM supports third-party hardware through a plugin system. These plugins live outside the main vLLM repository.

Available hardware plugins

  • Intel Gaudi - Intel’s AI accelerator chips
  • IBM Spyre - IBM’s AI acceleration platform
  • Huawei Ascend - Huawei’s NPU platform
  • And more…
For a complete list of supported hardware, visit the vLLM website. To add new hardware support, contact the team on Slack or via email.

Dependencies

vLLM requires several core dependencies that are automatically installed:

Core dependencies

torch==2.10.0
transformers >= 4.56.0, < 5
tokenizers >= 0.21.1
fastapi[standard] >= 0.115.0
aiohttp >= 3.13.3
openai >= 1.99.1
pydantic >= 2.12.0

CUDA-specific dependencies

ray[cgraph]>=2.48.0
flashinfer-python==0.6.4
nvidia-cutlass-dsl>=4.4.0.dev1
quack-kernels>=0.2.7

Optional dependencies

vLLM provides optional extras for specific use cases:
pip install vllm[bench]

Environment variables

Common environment variables for customizing vLLM behavior:
VariableDescriptionDefault
VLLM_USE_MODELSCOPEUse ModelScope instead of HuggingFaceFalse
VLLM_ATTENTION_BACKENDSet attention backendAuto-detected
CUDA_HOMEPath to CUDA installation/usr/local/cuda
HF_TOKENHuggingFace API tokenNone
VLLM_API_KEYAPI key for server authenticationNone

Verify installation

After installation, verify vLLM works correctly:
1

Check version

python -c "import vllm; print(f'vLLM version: {vllm.__version__}')"
2

Run quick test

from vllm import LLM, SamplingParams

# Create a simple LLM instance
llm = LLM(model="facebook/opt-125m")

# Generate text
output = llm.generate("Hello, my name is", SamplingParams(max_tokens=10))
print(output[0].outputs[0].text)
3

Test server mode

vllm serve facebook/opt-125m --port 8000
In another terminal:
curl http://localhost:8000/v1/models

Troubleshooting

Common issues

Reduce the model size, batch size, or use tensor parallelism:
vllm serve <model> --tensor-parallel-size 2 --max-model-len 2048
Ensure you activated the correct virtual environment and all dependencies are installed:
pip install --upgrade vllm
Mount the HuggingFace cache directory or pre-download models:
huggingface-cli download <model_name>
Flash Attention is automatically installed with CUDA builds. For FlashInfer:
pip install flashinfer-python

Next steps

Quickstart guide

Start running inference with vLLM

Supported models

Explore compatible models

Configuration

Learn about configuration options

Deployment

Deploy vLLM in production

Build docs developers (and LLMs) love