Overview
vLLM provides official Docker images for both GPU and CPU deployments. Docker containers ensure consistent environments and simplify deployment across different platforms.
Pre-built images
vLLM publishes pre-built Docker images to Docker Hub and public ECR registries:
GPU images
# Latest stable release
docker pull vllm/vllm-openai:latest
# Specific version
docker pull vllm/vllm-openai:v0.6.4
CPU images
# x86_64 CPU
docker pull public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest
# ARM64 CPU
docker pull public.ecr.aws/q9t5s3a7/vllm-arm64-cpu-release-repo:latest
Specialized images
vLLM provides images for different hardware backends:
- ROCm (AMD GPUs):
vllm/vllm-openai:latest-rocm
- TPU: Images available via Google Cloud Artifact Registry
- XPU (Intel GPUs): Custom builds available
Running vLLM with Docker
Basic GPU deployment
Run vLLM with a single GPU:docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.2-1B-Instruct
The --ipc=host flag is required for shared memory access in tensor parallel inference.
Configure shared memory
For larger models, increase shared memory:docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--shm-size=10.24gb \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-2-7b-chat-hf \
--tensor-parallel-size 2
Use environment variables
Pass Hugging Face token and other configurations:docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e HF_TOKEN=your_token_here \
-e VLLM_ENABLE_CUDA_COMPATIBILITY=1 \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model meta-llama/Meta-Llama-3-8B-Instruct
CPU deployment
For CPU-only deployments:
docker run -d \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e HF_TOKEN=your_token_here \
-e VLLM_CPU_KVCACHE_SPACE=40 \
-e VLLM_CPU_OMP_THREADS_BIND=0-63 \
-p 8000:8000 \
public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest \
--model meta-llama/Llama-3.2-1B-Instruct
CPU performance is significantly lower than GPU. Use CPU deployment only for testing or when GPUs are unavailable.
Building from source
Clone the repository
git clone https://github.com/vllm-project/vllm.git
cd vllm
Build GPU image
Build the default CUDA image:docker build -f docker/Dockerfile . --tag vllm-custom:latest
The build process uses Docker BuildKit for layer caching and parallel builds.
Build with custom CUDA version
docker build -f docker/Dockerfile . \
--build-arg CUDA_VERSION=12.9.1 \
--build-arg PYTHON_VERSION=3.12 \
--tag vllm-custom:cuda12.9
Build CPU image
docker build -f docker/Dockerfile.cpu . \
--platform=linux/amd64 \
--tag vllm-cpu:latest
Build ROCm image for AMD GPUs
docker build -f docker/Dockerfile.rocm . \
--tag vllm-rocm:latest
Build arguments
Common build arguments for customization:
| Argument | Default | Description |
|---|
CUDA_VERSION | 12.9.1 | CUDA toolkit version |
PYTHON_VERSION | 3.12 | Python version |
PYTORCH_NIGHTLY | 0 | Use PyTorch nightly builds |
MAX_JOBS | 2 | Parallel build jobs |
TORCH_CUDA_ARCH_LIST | 7.0 7.5 8.0 8.9 9.0 10.0 12.0 | Target GPU architectures |
INSTALL_KV_CONNECTORS | false | Install KV connector dependencies |
Advanced configurations
Multi-GPU deployment
docker run --runtime nvidia --gpus '"device=0,1,2,3"' \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--shm-size=16g \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model meta-llama/Meta-Llama-3-70B-Instruct \
--tensor-parallel-size 4
With proxy settings
docker build -f docker/Dockerfile . \
--build-arg http_proxy=$http_proxy \
--build-arg https_proxy=$https_proxy \
--tag vllm-custom:latest
Docker Compose example
version: '3.8'
services:
vllm:
image: vllm/vllm-openai:latest
runtime: nvidia
environment:
- HF_TOKEN=${HF_TOKEN}
- NVIDIA_VISIBLE_DEVICES=all
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
ports:
- "8000:8000"
shm_size: 10gb
ipc: host
command: >
--model meta-llama/Llama-3.2-1B-Instruct
--trust-remote-code
Load balancing with Nginx
For multiple vLLM instances behind a load balancer, see the production guide.
Troubleshooting
CUDA compatibility issues
Enable CUDA forward compatibility for older drivers:
docker run --runtime nvidia --gpus all \
-e VLLM_ENABLE_CUDA_COMPATIBILITY=1 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.2-1B-Instruct
Out of memory errors
- Increase shared memory with
--shm-size
- Reduce
max-model-len parameter
- Enable quantization (INT8, FP8)
Permission errors
Run with user namespace mapping:
docker run --runtime nvidia --gpus all \
--user $(id -u):$(id -g) \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.2-1B-Instruct
Next steps