Skip to main content

Overview

vLLM can be deployed on all major cloud providers using their managed Kubernetes services, VM instances, or specialized AI platforms. This guide covers deployment options for AWS, GCP, and Azure.

AWS deployment

Amazon EKS (Elastic Kubernetes Service)

1

Create EKS cluster

Create a GPU-enabled EKS cluster:
eksctl create cluster \
  --name vllm-cluster \
  --region us-west-2 \
  --nodegroup-name gpu-nodes \
  --node-type g5.xlarge \
  --nodes 2 \
  --nodes-min 1 \
  --nodes-max 4
2

Install NVIDIA device plugin

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml
3

Deploy vLLM

Follow the Kubernetes deployment guide to deploy vLLM to your EKS cluster.

Amazon EC2 instances

Deploy vLLM directly on EC2 GPU instances:
1

Launch GPU instance

Choose an appropriate GPU instance type:
Instance TypeGPUsvRAMUse Case
g5.xlarge1x A10G24GBSmall models (7B)
g5.12xlarge4x A10G96GBMedium models (13B-70B)
p4d.24xlarge8x A100320GBLarge models (70B+)
p5.48xlarge8x H100640GBLargest models
2

Install dependencies

# Install CUDA drivers
sudo apt update
sudo apt install -y nvidia-driver-535 nvidia-cuda-toolkit

# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
3

Run vLLM with Docker

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model meta-llama/Meta-Llama-3-8B-Instruct \
  --tensor-parallel-size $(nvidia-smi -L | wc -l)

Amazon SageMaker

vLLM provides a SageMaker-compatible image:
import boto3
from sagemaker import get_execution_role
from sagemaker.model import Model

role = get_execution_role()

model = Model(
    image_uri="vllm/vllm-openai:latest",
    role=role,
    env={
        "MODEL_NAME": "meta-llama/Meta-Llama-3-8B-Instruct",
        "HF_TOKEN": "your-hf-token"
    }
)

predictor = model.deploy(
    instance_type="ml.g5.xlarge",
    initial_instance_count=1
)
The vLLM SageMaker image includes the sagemaker-entrypoint.sh script for automatic model loading.

GCP deployment

Google Kubernetes Engine (GKE)

1

Create GKE cluster with GPUs

gcloud container clusters create vllm-cluster \
  --zone us-central1-a \
  --machine-type n1-standard-4 \
  --accelerator type=nvidia-tesla-t4,count=1 \
  --num-nodes 2 \
  --enable-autoscaling \
  --min-nodes 1 \
  --max-nodes 4
2

Install NVIDIA drivers

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
3

Deploy vLLM

Use the Kubernetes deployment guide with GKE-specific configurations.

Google Compute Engine

Deploy on GCE VM instances:
# Create GPU instance
gcloud compute instances create vllm-instance \
  --zone=us-central1-a \
  --machine-type=n1-standard-8 \
  --accelerator=type=nvidia-tesla-t4,count=1 \
  --maintenance-policy=TERMINATE \
  --image-family=ubuntu-2004-lts \
  --image-project=ubuntu-os-cloud \
  --boot-disk-size=200GB

# SSH and install
gcloud compute ssh vllm-instance --zone=us-central1-a

# Follow EC2 installation steps above

Vertex AI

Deploy using Vertex AI custom containers:
from google.cloud import aiplatform

aiplatform.init(project="your-project-id", location="us-central1")

model = aiplatform.Model.upload(
    display_name="vllm-llama3",
    serving_container_image_uri="vllm/vllm-openai:latest",
    serving_container_environment_variables={
        "MODEL_NAME": "meta-llama/Meta-Llama-3-8B-Instruct"
    }
)

endpoint = model.deploy(
    machine_type="n1-standard-4",
    accelerator_type="NVIDIA_TESLA_T4",
    accelerator_count=1
)

Azure deployment

Azure Kubernetes Service (AKS)

1

Create AKS cluster

az aks create \
  --resource-group vllm-rg \
  --name vllm-cluster \
  --node-count 2 \
  --node-vm-size Standard_NC6s_v3 \
  --enable-cluster-autoscaler \
  --min-count 1 \
  --max-count 4 \
  --generate-ssh-keys
2

Get credentials

az aks get-credentials \
  --resource-group vllm-rg \
  --name vllm-cluster
3

Deploy vLLM

Azure Virtual Machines

Deploy on Azure GPU VMs:
az vm create \
  --resource-group vllm-rg \
  --name vllm-vm \
  --image UbuntuLTS \
  --size Standard_NC6s_v3 \
  --admin-username azureuser \
  --generate-ssh-keys

az vm run-command invoke \
  --resource-group vllm-rg \
  --name vllm-vm \
  --command-id RunShellScript \
  --scripts @install-vllm.sh

Multi-cloud deployment with SkyPilot

SkyPilot enables deployment across AWS, GCP, and Azure with a single configuration:
1

Install SkyPilot

pip install skypilot-nightly
sky check
2

Create deployment YAML

resources:
  accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB}
  use_spot: True
  disk_size: 512
  disk_tier: best
  ports: 8081

envs:
  MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
  HF_TOKEN: <your-token>

setup: |
  conda create -n vllm python=3.10 -y
  conda activate vllm
  pip install vllm

run: |
  conda activate vllm
  vllm serve $MODEL_NAME \
    --port 8081 \
    --trust-remote-code \
    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE
3

Launch on any cloud

# Launch on cheapest available cloud
HF_TOKEN="your-token" sky launch serving.yaml --env HF_TOKEN

# Force specific cloud
HF_TOKEN="your-token" sky launch serving.yaml --cloud aws --env HF_TOKEN
HF_TOKEN="your-token" sky launch serving.yaml --cloud gcp --env HF_TOKEN
HF_TOKEN="your-token" sky launch serving.yaml --cloud azure --env HF_TOKEN
4

Scale with autoscaling

service:
  replica_policy:
    min_replicas: 2
    max_replicas: 10
    target_qps_per_replica: 2
Deploy with autoscaling:
sky serve up -n vllm serving.yaml --env HF_TOKEN
SkyPilot automatically selects the cheapest cloud and GPU type based on availability and pricing.

Cloud-specific optimizations

AWS optimizations

  • Use EFA (Elastic Fabric Adapter) for multi-node deployments on p4d/p5 instances
  • Enable S3 model caching to reduce startup time
  • Use EC2 Spot Instances for cost savings (up to 90% cheaper)

GCP optimizations

  • Use Compact Placement Policies for multi-GPU communication
  • Enable GCS FUSE for model caching
  • Use Preemptible VMs for cost savings

Azure optimizations

  • Use InfiniBand on NDv4 series for multi-node communication
  • Enable Azure Blob Storage for model caching
  • Use Spot VMs for cost savings

Managed platforms

Cloud-native AI platforms with vLLM support:

Hugging Face Inference Endpoints

Deploy models with one click:
  1. Go to Hugging Face Inference Endpoints
  2. Select your model
  3. Choose vLLM as the container type
  4. Select cloud provider (AWS, Azure, GCP)
  5. Choose GPU type (T4, L4, A10G, A100)
  6. Deploy
Hugging Face manages the infrastructure, scaling, and monitoring for you.

Anyscale

Anyscale provides managed Ray clusters with vLLM:
  • Automatic scaling across AWS, GCP, Azure
  • Built-in observability and monitoring
  • Multi-model serving
  • Production-ready deployments
Deploy vLLM on Modal with serverless infrastructure:
import modal

app = modal.App("vllm-app")
image = modal.Image.debian_slim().pip_install("vllm")

@app.function(
    gpu="A100",
    image=image,
    timeout=600
)
def serve_model():
    from vllm import LLM
    llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
    return llm

Cost optimization strategies

1

Use spot/preemptible instances

Save 60-90% on compute costs with spot instances. SkyPilot automatically handles spot instance interruptions.
2

Enable autoscaling

Scale down to zero during off-peak hours. Use Kubernetes HPA or cloud-native autoscaling.
3

Right-size GPU selection

  • 7B models: T4, L4 (consumer GPUs)
  • 13B-70B models: A10G, A100 40GB
  • 70B+ models: A100 80GB, H100
4

Use model caching

Cache models in cloud storage to reduce download time and costs.
5

Enable prefix caching

Reduce compute costs by caching common prefixes with vLLM’s automatic prefix caching.

Next steps

Build docs developers (and LLMs) love