Overview
vLLM can be deployed on all major cloud providers using their managed Kubernetes services, VM instances, or specialized AI platforms. This guide covers deployment options for AWS, GCP, and Azure.
AWS deployment
Amazon EKS (Elastic Kubernetes Service)
Create EKS cluster
Create a GPU-enabled EKS cluster:eksctl create cluster \
--name vllm-cluster \
--region us-west-2 \
--nodegroup-name gpu-nodes \
--node-type g5.xlarge \
--nodes 2 \
--nodes-min 1 \
--nodes-max 4
Install NVIDIA device plugin
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml
Amazon EC2 instances
Deploy vLLM directly on EC2 GPU instances:
Launch GPU instance
Choose an appropriate GPU instance type:| Instance Type | GPUs | vRAM | Use Case |
|---|
| g5.xlarge | 1x A10G | 24GB | Small models (7B) |
| g5.12xlarge | 4x A10G | 96GB | Medium models (13B-70B) |
| p4d.24xlarge | 8x A100 | 320GB | Large models (70B+) |
| p5.48xlarge | 8x H100 | 640GB | Largest models |
Install dependencies
# Install CUDA drivers
sudo apt update
sudo apt install -y nvidia-driver-535 nvidia-cuda-toolkit
# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
Run vLLM with Docker
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--tensor-parallel-size $(nvidia-smi -L | wc -l)
Amazon SageMaker
vLLM provides a SageMaker-compatible image:
import boto3
from sagemaker import get_execution_role
from sagemaker.model import Model
role = get_execution_role()
model = Model(
image_uri="vllm/vllm-openai:latest",
role=role,
env={
"MODEL_NAME": "meta-llama/Meta-Llama-3-8B-Instruct",
"HF_TOKEN": "your-hf-token"
}
)
predictor = model.deploy(
instance_type="ml.g5.xlarge",
initial_instance_count=1
)
The vLLM SageMaker image includes the sagemaker-entrypoint.sh script for automatic model loading.
GCP deployment
Google Kubernetes Engine (GKE)
Create GKE cluster with GPUs
gcloud container clusters create vllm-cluster \
--zone us-central1-a \
--machine-type n1-standard-4 \
--accelerator type=nvidia-tesla-t4,count=1 \
--num-nodes 2 \
--enable-autoscaling \
--min-nodes 1 \
--max-nodes 4
Install NVIDIA drivers
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
Google Compute Engine
Deploy on GCE VM instances:
# Create GPU instance
gcloud compute instances create vllm-instance \
--zone=us-central1-a \
--machine-type=n1-standard-8 \
--accelerator=type=nvidia-tesla-t4,count=1 \
--maintenance-policy=TERMINATE \
--image-family=ubuntu-2004-lts \
--image-project=ubuntu-os-cloud \
--boot-disk-size=200GB
# SSH and install
gcloud compute ssh vllm-instance --zone=us-central1-a
# Follow EC2 installation steps above
Vertex AI
Deploy using Vertex AI custom containers:
from google.cloud import aiplatform
aiplatform.init(project="your-project-id", location="us-central1")
model = aiplatform.Model.upload(
display_name="vllm-llama3",
serving_container_image_uri="vllm/vllm-openai:latest",
serving_container_environment_variables={
"MODEL_NAME": "meta-llama/Meta-Llama-3-8B-Instruct"
}
)
endpoint = model.deploy(
machine_type="n1-standard-4",
accelerator_type="NVIDIA_TESLA_T4",
accelerator_count=1
)
Azure deployment
Azure Kubernetes Service (AKS)
Create AKS cluster
az aks create \
--resource-group vllm-rg \
--name vllm-cluster \
--node-count 2 \
--node-vm-size Standard_NC6s_v3 \
--enable-cluster-autoscaler \
--min-count 1 \
--max-count 4 \
--generate-ssh-keys
Get credentials
az aks get-credentials \
--resource-group vllm-rg \
--name vllm-cluster
Azure Virtual Machines
Deploy on Azure GPU VMs:
az vm create \
--resource-group vllm-rg \
--name vllm-vm \
--image UbuntuLTS \
--size Standard_NC6s_v3 \
--admin-username azureuser \
--generate-ssh-keys
az vm run-command invoke \
--resource-group vllm-rg \
--name vllm-vm \
--command-id RunShellScript \
--scripts @install-vllm.sh
Multi-cloud deployment with SkyPilot
SkyPilot enables deployment across AWS, GCP, and Azure with a single configuration:
Install SkyPilot
pip install skypilot-nightly
sky check
Create deployment YAML
resources:
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB}
use_spot: True
disk_size: 512
disk_tier: best
ports: 8081
envs:
MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
HF_TOKEN: <your-token>
setup: |
conda create -n vllm python=3.10 -y
conda activate vllm
pip install vllm
run: |
conda activate vllm
vllm serve $MODEL_NAME \
--port 8081 \
--trust-remote-code \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE
Launch on any cloud
# Launch on cheapest available cloud
HF_TOKEN="your-token" sky launch serving.yaml --env HF_TOKEN
# Force specific cloud
HF_TOKEN="your-token" sky launch serving.yaml --cloud aws --env HF_TOKEN
HF_TOKEN="your-token" sky launch serving.yaml --cloud gcp --env HF_TOKEN
HF_TOKEN="your-token" sky launch serving.yaml --cloud azure --env HF_TOKEN
Scale with autoscaling
service:
replica_policy:
min_replicas: 2
max_replicas: 10
target_qps_per_replica: 2
Deploy with autoscaling:sky serve up -n vllm serving.yaml --env HF_TOKEN
SkyPilot automatically selects the cheapest cloud and GPU type based on availability and pricing.
Cloud-specific optimizations
AWS optimizations
- Use EFA (Elastic Fabric Adapter) for multi-node deployments on p4d/p5 instances
- Enable S3 model caching to reduce startup time
- Use EC2 Spot Instances for cost savings (up to 90% cheaper)
GCP optimizations
- Use Compact Placement Policies for multi-GPU communication
- Enable GCS FUSE for model caching
- Use Preemptible VMs for cost savings
Azure optimizations
- Use InfiniBand on NDv4 series for multi-node communication
- Enable Azure Blob Storage for model caching
- Use Spot VMs for cost savings
Cloud-native AI platforms with vLLM support:
Hugging Face Inference Endpoints
Deploy models with one click:
- Go to Hugging Face Inference Endpoints
- Select your model
- Choose vLLM as the container type
- Select cloud provider (AWS, Azure, GCP)
- Choose GPU type (T4, L4, A10G, A100)
- Deploy
Hugging Face manages the infrastructure, scaling, and monitoring for you.
Anyscale
Anyscale provides managed Ray clusters with vLLM:
- Automatic scaling across AWS, GCP, Azure
- Built-in observability and monitoring
- Multi-model serving
- Production-ready deployments
Modal
Deploy vLLM on Modal with serverless infrastructure:
import modal
app = modal.App("vllm-app")
image = modal.Image.debian_slim().pip_install("vllm")
@app.function(
gpu="A100",
image=image,
timeout=600
)
def serve_model():
from vllm import LLM
llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
return llm
Cost optimization strategies
Use spot/preemptible instances
Save 60-90% on compute costs with spot instances. SkyPilot automatically handles spot instance interruptions.
Enable autoscaling
Scale down to zero during off-peak hours. Use Kubernetes HPA or cloud-native autoscaling.
Right-size GPU selection
- 7B models: T4, L4 (consumer GPUs)
- 13B-70B models: A10G, A100 40GB
- 70B+ models: A100 80GB, H100
Use model caching
Cache models in cloud storage to reduce download time and costs.
Enable prefix caching
Reduce compute costs by caching common prefixes with vLLM’s automatic prefix caching.
Next steps