Cloud provider deployment

Overview

vLLM can be deployed on all major cloud providers using their managed Kubernetes services, VM instances, or specialized AI platforms. This guide covers deployment options for AWS, GCP, and Azure.

AWS deployment

Amazon EKS (Elastic Kubernetes Service)

Create EKS cluster

Create a GPU-enabled EKS cluster:

eksctl create cluster \
  --name vllm-cluster \
  --region us-west-2 \
  --nodegroup-name gpu-nodes \
  --node-type g5.xlarge \
  --nodes 2 \
  --nodes-min 1 \
  --nodes-max 4

Install NVIDIA device plugin

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

Deploy vLLM

Follow the Kubernetes deployment guide to deploy vLLM to your EKS cluster.

Amazon EC2 instances

Deploy vLLM directly on EC2 GPU instances:

Launch GPU instance

Choose an appropriate GPU instance type:

Instance Type	GPUs	vRAM	Use Case
g5.xlarge	1x A10G	24GB	Small models (7B)
g5.12xlarge	4x A10G	96GB	Medium models (13B-70B)
p4d.24xlarge	8x A100	320GB	Large models (70B+)
p5.48xlarge	8x H100	640GB	Largest models

Install dependencies

# Install CUDA drivers
sudo apt update
sudo apt install -y nvidia-driver-535 nvidia-cuda-toolkit

# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Run vLLM with Docker

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model meta-llama/Meta-Llama-3-8B-Instruct \
  --tensor-parallel-size $(nvidia-smi -L | wc -l)

Amazon SageMaker

vLLM provides a SageMaker-compatible image:

import boto3
from sagemaker import get_execution_role
from sagemaker.model import Model

role = get_execution_role()

model = Model(
    image_uri="vllm/vllm-openai:latest",
    role=role,
    env={
        "MODEL_NAME": "meta-llama/Meta-Llama-3-8B-Instruct",
        "HF_TOKEN": "your-hf-token"
    }
)

predictor = model.deploy(
    instance_type="ml.g5.xlarge",
    initial_instance_count=1
)

The vLLM SageMaker image includes the sagemaker-entrypoint.sh script for automatic model loading.

GCP deployment

Google Kubernetes Engine (GKE)

Create GKE cluster with GPUs

gcloud container clusters create vllm-cluster \
  --zone us-central1-a \
  --machine-type n1-standard-4 \
  --accelerator type=nvidia-tesla-t4,count=1 \
  --num-nodes 2 \
  --enable-autoscaling \
  --min-nodes 1 \
  --max-nodes 4

Install NVIDIA drivers

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

Deploy vLLM

Use the Kubernetes deployment guide with GKE-specific configurations.

Google Compute Engine

Deploy on GCE VM instances:

# Create GPU instance
gcloud compute instances create vllm-instance \
  --zone=us-central1-a \
  --machine-type=n1-standard-8 \
  --accelerator=type=nvidia-tesla-t4,count=1 \
  --maintenance-policy=TERMINATE \
  --image-family=ubuntu-2004-lts \
  --image-project=ubuntu-os-cloud \
  --boot-disk-size=200GB

# SSH and install
gcloud compute ssh vllm-instance --zone=us-central1-a

# Follow EC2 installation steps above

Vertex AI

Deploy using Vertex AI custom containers:

from google.cloud import aiplatform

aiplatform.init(project="your-project-id", location="us-central1")

model = aiplatform.Model.upload(
    display_name="vllm-llama3",
    serving_container_image_uri="vllm/vllm-openai:latest",
    serving_container_environment_variables={
        "MODEL_NAME": "meta-llama/Meta-Llama-3-8B-Instruct"
    }
)

endpoint = model.deploy(
    machine_type="n1-standard-4",
    accelerator_type="NVIDIA_TESLA_T4",
    accelerator_count=1
)

Azure deployment

Azure Kubernetes Service (AKS)

Create AKS cluster

az aks create \
  --resource-group vllm-rg \
  --name vllm-cluster \
  --node-count 2 \
  --node-vm-size Standard_NC6s_v3 \
  --enable-cluster-autoscaler \
  --min-count 1 \
  --max-count 4 \
  --generate-ssh-keys

Get credentials

az aks get-credentials \
  --resource-group vllm-rg \
  --name vllm-cluster

Deploy vLLM

Follow the Kubernetes deployment guide.

Azure Virtual Machines

Deploy on Azure GPU VMs:

az vm create \
  --resource-group vllm-rg \
  --name vllm-vm \
  --image UbuntuLTS \
  --size Standard_NC6s_v3 \
  --admin-username azureuser \
  --generate-ssh-keys

az vm run-command invoke \
  --resource-group vllm-rg \
  --name vllm-vm \
  --command-id RunShellScript \
  --scripts @install-vllm.sh

Multi-cloud deployment with SkyPilot

SkyPilot enables deployment across AWS, GCP, and Azure with a single configuration:

Install SkyPilot

pip install skypilot-nightly
sky check

Create deployment YAML

resources:
  accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB}
  use_spot: True
  disk_size: 512
  disk_tier: best
  ports: 8081

envs:
  MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
  HF_TOKEN: <your-token>

setup: |
  conda create -n vllm python=3.10 -y
  conda activate vllm
  pip install vllm

run: |
  conda activate vllm
  vllm serve $MODEL_NAME \
    --port 8081 \
    --trust-remote-code \
    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE

Launch on any cloud

# Launch on cheapest available cloud
HF_TOKEN="your-token" sky launch serving.yaml --env HF_TOKEN

# Force specific cloud
HF_TOKEN="your-token" sky launch serving.yaml --cloud aws --env HF_TOKEN
HF_TOKEN="your-token" sky launch serving.yaml --cloud gcp --env HF_TOKEN
HF_TOKEN="your-token" sky launch serving.yaml --cloud azure --env HF_TOKEN

Scale with autoscaling

service:
  replica_policy:
    min_replicas: 2
    max_replicas: 10
    target_qps_per_replica: 2

Deploy with autoscaling:

sky serve up -n vllm serving.yaml --env HF_TOKEN

SkyPilot automatically selects the cheapest cloud and GPU type based on availability and pricing.

Cloud-specific optimizations

AWS optimizations

Use EFA (Elastic Fabric Adapter) for multi-node deployments on p4d/p5 instances
Enable S3 model caching to reduce startup time
Use EC2 Spot Instances for cost savings (up to 90% cheaper)

GCP optimizations

Use Compact Placement Policies for multi-GPU communication
Enable GCS FUSE for model caching
Use Preemptible VMs for cost savings

Azure optimizations

Use InfiniBand on NDv4 series for multi-node communication
Enable Azure Blob Storage for model caching
Use Spot VMs for cost savings

Managed platforms

Cloud-native AI platforms with vLLM support:

Hugging Face Inference Endpoints

Deploy models with one click:

Go to Hugging Face Inference Endpoints
Select your model
Choose vLLM as the container type
Select cloud provider (AWS, Azure, GCP)
Choose GPU type (T4, L4, A10G, A100)
Deploy

Hugging Face manages the infrastructure, scaling, and monitoring for you.

Anyscale

Anyscale provides managed Ray clusters with vLLM:

Automatic scaling across AWS, GCP, Azure
Built-in observability and monitoring
Multi-model serving
Production-ready deployments

Deploy vLLM on Modal with serverless infrastructure:

import modal

app = modal.App("vllm-app")
image = modal.Image.debian_slim().pip_install("vllm")

@app.function(
    gpu="A100",
    image=image,
    timeout=600
)
def serve_model():
    from vllm import LLM
    llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
    return llm

Cost optimization strategies

Use spot/preemptible instances

Save 60-90% on compute costs with spot instances. SkyPilot automatically handles spot instance interruptions.

Enable autoscaling

Scale down to zero during off-peak hours. Use Kubernetes HPA or cloud-native autoscaling.

Right-size GPU selection

7B models: T4, L4 (consumer GPUs)
13B-70B models: A10G, A100 40GB
70B+ models: A100 80GB, H100

Use model caching

Cache models in cloud storage to reduce download time and costs.

Enable prefix caching

Reduce compute costs by caching common prefixes with vLLM’s automatic prefix caching.

Next steps

Review production deployment best practices
Learn about Kubernetes deployment
Explore Docker deployment options

Get Started

Core Concepts

Serving

Models

Features

Configuration

Deployment

Cloud provider deployment

Overview

AWS deployment

Amazon EKS (Elastic Kubernetes Service)

Amazon EC2 instances

Amazon SageMaker

GCP deployment

Google Kubernetes Engine (GKE)

Google Compute Engine

Vertex AI

Azure deployment

Azure Kubernetes Service (AKS)

Azure Virtual Machines

Multi-cloud deployment with SkyPilot

Cloud-specific optimizations

AWS optimizations

GCP optimizations

Azure optimizations

Managed platforms

Hugging Face Inference Endpoints

Anyscale

Cost optimization strategies

Next steps

Build docs developers (and LLMs) love

Get Started

Core Concepts

Serving

Models

Features

Configuration

Deployment

​Overview

​AWS deployment

​Amazon EKS (Elastic Kubernetes Service)

​Amazon EC2 instances

​Amazon SageMaker

​GCP deployment

​Google Kubernetes Engine (GKE)

​Google Compute Engine

​Vertex AI

​Azure deployment

​Azure Kubernetes Service (AKS)

​Azure Virtual Machines

​Multi-cloud deployment with SkyPilot

​Cloud-specific optimizations

​AWS optimizations

​GCP optimizations

​Azure optimizations

​Managed platforms

​Hugging Face Inference Endpoints

​Anyscale

​Modal

​Cost optimization strategies

​Next steps

Build docs developers (and LLMs) love

Overview

AWS deployment

Amazon EKS (Elastic Kubernetes Service)

Amazon EC2 instances

Amazon SageMaker

GCP deployment

Google Kubernetes Engine (GKE)

Google Compute Engine

Vertex AI

Azure deployment

Azure Kubernetes Service (AKS)

Azure Virtual Machines

Multi-cloud deployment with SkyPilot

Cloud-specific optimizations

AWS optimizations

GCP optimizations

Azure optimizations

Managed platforms

Hugging Face Inference Endpoints

Anyscale

Modal

Cost optimization strategies

Next steps