GCP Vertex AI Integration

The GCP Vertex AI integration provides orchestration, step execution, and experiment tracking using Google Cloud’s Vertex AI platform.

This page covers Vertex AI-specific details. For general GCP setup, see the GCP Integration page.

Installation

pip install "zenml[gcp]"

This installs:

google-cloud-aiplatform>=1.34.0 - Vertex AI SDK
kfp>=2.6.0 - Kubeflow Pipelines SDK (used by Vertex)
google-cloud-pipeline-components>=2.19.0 - Pre-built components
kubernetes - Kubernetes Python client

Components

Vertex AI Orchestrator

Execute complete pipelines as Vertex AI Pipelines

Vertex AI Step Operator

Run individual steps as Vertex AI custom jobs

Vertex Experiment Tracker

Track experiments in Vertex AI Experiments

Vertex AI Orchestrator

Runs your complete pipeline as a Vertex AI Pipeline using Kubeflow Pipelines v2.

Configuration

zenml orchestrator register vertex-orch \
    --flavor=vertex \
    --project=my-gcp-project \
    --location=us-central1 \
    --pipeline_root=gs://my-vertex-bucket/pipelines

Required:

project - GCP project ID
location - GCP region (e.g., us-central1, europe-west1)

Optional:

pipeline_root - GCS URI for pipeline artifacts
workload_service_account - Service account for execution
network - VPC network for private connectivity
encryption_spec_key_name - Cloud KMS encryption key
private_service_connect - Private Service Connect endpoint

Step Settings

Customize steps with VertexOrchestratorSettings and KubernetesPodSettings:

from zenml import step, pipeline
from zenml.integrations.gcp.flavors.vertex_orchestrator_flavor import (
    VertexOrchestratorSettings,
)
from zenml.integrations.kubernetes.pod_settings import KubernetesPodSettings

@step(
    settings={
        "orchestrator": VertexOrchestratorSettings(
            pod_settings=KubernetesPodSettings(
                # GPU configuration
                node_selectors={
                    "cloud.google.com/gke-accelerator": "NVIDIA_TESLA_T4"
                },
                resources={
                    "requests": {
                        "memory": "16Gi",
                        "cpu": "4",
                    },
                    "limits": {
                        "memory": "16Gi",
                        "cpu": "4",
                        "nvidia.com/gpu": "1",
                    },
                },
                tolerations=[
                    {
                        "key": "nvidia.com/gpu",
                        "operator": "Exists",
                        "effect": "NoSchedule",
                    }
                ],
                # Volume mounts
                volumes=[
                    {
                        "name": "gcs-fuse",
                        "emptyDir": {},
                    }
                ],
                volume_mounts=[
                    {
                        "name": "gcs-fuse",
                        "mountPath": "/gcs",
                    }
                ],
            ),
            # Pipeline-level settings
            labels={
                "team": "ml-ops",
                "project": "recommendation",
                "environment": "production",
            },
            synchronous=True,  # Wait for completion
        )
    }
)
def train_on_gpu(data: pd.DataFrame) -> Model:
    # Training code
    ...

Available Settings:

Setting	Type	Description
`pod_settings`	`KubernetesPodSettings`	Kubernetes Pod configuration
`labels`	dict	GCP labels for the pipeline job
`synchronous`	bool	Wait for pipeline completion
`node_selector_constraint`	tuple	(Deprecated) Use `pod_settings.node_selectors`
`custom_job_parameters`	`VertexCustomJobParameters`	Advanced custom job settings

Machine Types

Vertex AI uses GCP machine types: Standard:

n1-standard-4 - 4 vCPU, 15 GB RAM
n1-standard-8 - 8 vCPU, 30 GB RAM
n1-standard-16 - 16 vCPU, 60 GB RAM

High-Memory:

n1-highmem-4 - 4 vCPU, 26 GB RAM
n1-highmem-8 - 8 vCPU, 52 GB RAM
n1-highmem-16 - 16 vCPU, 104 GB RAM

High-CPU:

n1-highcpu-8 - 8 vCPU, 7.2 GB RAM
n1-highcpu-16 - 16 vCPU, 14.4 GB RAM

Specify via resource requests:

KubernetesPodSettings(
    resources={
        "requests": {
            "cpu": "8",      # n1-standard-8
            "memory": "30Gi",
        }
    }
)

GPU Accelerators

Available GPUs:

NVIDIA_TESLA_K80 - Legacy, low cost
NVIDIA_TESLA_P4 - Inference optimized
NVIDIA_TESLA_T4 - Good price/performance
NVIDIA_TESLA_V100 - High performance training
NVIDIA_TESLA_P100 - High performance
NVIDIA_TESLA_A100 - Latest, 40GB or 80GB

GPU Configuration:

KubernetesPodSettings(
    node_selectors={
        "cloud.google.com/gke-accelerator": "NVIDIA_TESLA_T4",
    },
    resources={
        "limits": {
            "nvidia.com/gpu": "2",  # Request 2 GPUs
        }
    },
    tolerations=[
        {
            "key": "nvidia.com/gpu",
            "operator": "Exists",
            "effect": "NoSchedule",
        }
    ],
)

Check GPU availability by region:

GPU regions and zones

Custom Job Parameters

Advanced configuration for Vertex AI custom jobs:

from zenml.integrations.gcp.vertex_custom_job_parameters import (
    VertexCustomJobParameters,
)

VertexOrchestratorSettings(
    custom_job_parameters=VertexCustomJobParameters(
        worker_pool_specs=[
            {
                "machine_spec": {
                    "machine_type": "n1-standard-8",
                    "accelerator_type": "NVIDIA_TESLA_T4",
                    "accelerator_count": 1,
                },
                "replica_count": 1,
                "container_spec": {
                    "image_uri": "gcr.io/my-project/my-image:latest",
                },
            }
        ],
        scheduling={
            "timeout": "3600s",
            "restart_job_on_worker_restart": False,
        },
        service_account="[email protected]",
        network="projects/my-project/global/networks/my-vpc",
        enable_web_access=True,  # SSH access
    ),
)

Vertex AI Step Operator

Runs individual steps as Vertex AI custom jobs.

Configuration

zenml step-operator register vertex-step-op \
    --flavor=vertex \
    --project=my-gcp-project \
    --location=us-central1 \
    --service_account=vertex-sa@my-gcp-project.iam.gserviceaccount.com

Usage

from zenml import step, pipeline

@step(step_operator="vertex-step-op")
def train_on_vertex(data: pd.DataFrame) -> Model:
    # Runs on Vertex AI
    ...

@step
def preprocess_locally(raw_data: pd.DataFrame) -> pd.DataFrame:
    # Runs locally
    ...

@pipeline
def hybrid_pipeline():
    data = preprocess_locally(...)
    model = train_on_vertex(data)

Vertex AI Experiments

Track experiments with Vertex AI Experiments.

Configuration

zenml experiment-tracker register vertex-experiments \
    --flavor=vertex \
    --project=my-gcp-project \
    --location=us-central1

Usage

from zenml import step
from zenml.client import Client

experiment_tracker = Client().active_stack.experiment_tracker

@step(experiment_tracker="vertex-experiments")
def train_model(data: pd.DataFrame) -> Model:
    # Log parameters
    experiment_tracker.log_params({
        "learning_rate": 0.001,
        "batch_size": 32,
        "epochs": 10,
    })
    
    # Training loop
    for epoch in range(10):
        loss = train_epoch(model, data)
        accuracy = evaluate(model, val_data)
        
        # Log metrics
        experiment_tracker.log_metrics(
            {"loss": loss, "accuracy": accuracy},
            step=epoch,
        )
    
    return model

Viewing Experiments

View experiments in the Vertex AI Console:

Go to Vertex AI > Experiments
Select your experiment
Compare runs and metrics
Visualize training curves

Service Account Setup

Create a service account with required permissions:

# Create service account
gcloud iam service-accounts create vertex-sa \
    --display-name="Vertex AI ZenML Service Account"

# Grant Vertex AI User role
gcloud projects add-iam-policy-binding my-gcp-project \
    --member="serviceAccount:[email protected]" \
    --role="roles/aiplatform.user"

# Grant Storage Admin role (for GCS artifacts)
gcloud projects add-iam-policy-binding my-gcp-project \
    --member="serviceAccount:[email protected]" \
    --role="roles/storage.objectAdmin"

# Grant Artifact Registry Reader role
gcloud projects add-iam-policy-binding my-gcp-project \
    --member="serviceAccount:[email protected]" \
    --role="roles/artifactregistry.reader"

# Create and download key
gcloud iam service-accounts keys create vertex-sa-key.json \
    [email protected]

Required IAM Roles:

roles/aiplatform.user - Create and manage Vertex AI resources
roles/storage.objectAdmin - Read/write GCS artifacts
roles/artifactregistry.reader - Pull container images

Complete Example

from zenml import step, pipeline
from zenml.integrations.gcp.flavors.vertex_orchestrator_flavor import (
    VertexOrchestratorSettings,
)
from zenml.integrations.kubernetes.pod_settings import KubernetesPodSettings
import pandas as pd

@step
def load_data() -> pd.DataFrame:
    return pd.read_csv("gs://my-bucket/data.csv")

@step(
    settings={
        "orchestrator": VertexOrchestratorSettings(
            pod_settings=KubernetesPodSettings(
                resources={
                    "requests": {"cpu": "2", "memory": "8Gi"},
                }
            ),
            labels={"stage": "preprocessing"},
        )
    }
)
def preprocess_data(data: pd.DataFrame) -> pd.DataFrame:
    return data.dropna()

@step(
    experiment_tracker="vertex-experiments",
    settings={
        "orchestrator": VertexOrchestratorSettings(
            pod_settings=KubernetesPodSettings(
                node_selectors={
                    "cloud.google.com/gke-accelerator": "NVIDIA_TESLA_T4",
                },
                resources={
                    "requests": {"cpu": "4", "memory": "16Gi"},
                    "limits": {"nvidia.com/gpu": "1"},
                },
            ),
            labels={"stage": "training", "model": "v2"},
        )
    }
)
def train_model(data: pd.DataFrame) -> Model:
    from zenml.client import Client
    
    tracker = Client().active_stack.experiment_tracker
    tracker.log_params({"learning_rate": 0.001})
    
    # Training code
    model = train(...)
    
    tracker.log_metrics({"accuracy": 0.95})
    return model

@pipeline
def training_pipeline():
    data = load_data()
    processed = preprocess_data(data)
    model = train_model(processed)

Best Practices

Use Workload Identity

When running from GKE, use Workload Identity instead of key files:

gcloud iam service-accounts add-iam-policy-binding \
    [email protected] \
    --role=roles/iam.workloadIdentityUser \
    --member="serviceAccount:my-gcp-project.svc.id.goog[default/zenml]"

Enable Private GKE and VPC

Use private networking for security:

zenml orchestrator register vertex-orch \
    --network=projects/my-gcp-project/global/networks/my-vpc \
    --private_service_connect=projects/my-gcp-project/regions/us-central1/networkAttachments/my-psc

Use Customer-Managed Encryption

Encrypt data at rest with CMEK:

zenml orchestrator register vertex-orch \
    --encryption_spec_key_name=projects/my-gcp-project/locations/us-central1/keyRings/my-keyring/cryptoKeys/my-key

Label Resources for Cost Tracking

Use labels for billing analysis:

VertexOrchestratorSettings(
    labels={
        "project": "recommendation",
        "team": "ml-ops",
        "environment": "production",
        "cost-center": "engineering",
    }
)

Monitoring

View Pipeline Runs:

Go to Vertex AI Console > Pipelines
Select your pipeline
View execution DAG and logs
Click steps to see details

Cloud Logging:

# View logs for a specific run
gcloud logging read \
    "resource.type=aiplatform.googleapis.com/PipelineJob" \
    --limit 50 \
    --format json

Next Steps

GCP Integration

General GCP integration guide

Kubeflow Integration

Compare with Kubeflow Pipelines

Experiment Tracking

Learn about experiment tracking

Vertex AI Docs

Official Vertex AI documentation

Getting Started

Core Concepts

Guides

Stack Components

Integrations

Advanced

Deployment

GCP Vertex AI Integration

Installation

Components

Vertex AI Orchestrator

Vertex AI Step Operator

Vertex Experiment Tracker

Vertex AI Orchestrator

Configuration

Step Settings

Machine Types

GPU Accelerators

Custom Job Parameters

Vertex AI Step Operator

Configuration

Usage

Vertex AI Experiments

Configuration

Usage

Viewing Experiments

Service Account Setup

Complete Example

Best Practices

Monitoring

Next Steps

GCP Integration

Kubeflow Integration

Experiment Tracking

Vertex AI Docs

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Guides

Stack Components

Integrations

Advanced

Deployment

​Installation

​Components

Vertex AI Orchestrator

Vertex AI Step Operator

Vertex Experiment Tracker

​Vertex AI Orchestrator

​Configuration

​Step Settings

​Machine Types

​GPU Accelerators

​Custom Job Parameters

​Vertex AI Step Operator

​Configuration

​Usage

​Vertex AI Experiments

​Configuration

​Usage

​Viewing Experiments

​Service Account Setup

​Complete Example

​Best Practices

​Monitoring

​Next Steps

GCP Integration

Kubeflow Integration

Experiment Tracking

Vertex AI Docs

Build docs developers (and LLMs) love

Installation

Components

Vertex AI Orchestrator

Configuration

Step Settings

Machine Types

GPU Accelerators

Custom Job Parameters

Vertex AI Step Operator

Configuration

Usage

Vertex AI Experiments

Configuration

Usage

Viewing Experiments

Service Account Setup

Complete Example

Best Practices

Monitoring

Next Steps