Kubeflow Integration

The Kubeflow integration enables running ZenML pipelines using Kubeflow Pipelines (KFP) v2, providing advanced orchestration features on Kubernetes clusters.

Installation

pip install "zenml[kubeflow]"

This installs:

kfp>=2.6.0 - Kubeflow Pipelines SDK v2
kfp-kubernetes>=1.1.0 - Kubernetes-specific KFP extensions

Available Components

The Kubeflow integration provides:

Kubeflow Orchestrator

Execute complete pipelines using Kubeflow Pipelines on Kubernetes

Kubeflow Orchestrator

The Kubeflow orchestrator compiles ZenML pipelines into KFP format and executes them on a Kubeflow Pipelines deployment.

Prerequisites

Before using the Kubeflow orchestrator:

Kubernetes cluster with Kubeflow Pipelines installed
kubectl access configured
Container registry accessible from the cluster
Artifact store accessible from the cluster (S3, GCS, etc.)

Installing Kubeflow Pipelines

Standalone KFP (Recommended):

# Install KFP standalone (without full Kubeflow)
export PIPELINE_VERSION=2.0.5
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=$PIPELINE_VERSION"
kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic?ref=$PIPELINE_VERSION"

Full Kubeflow: Follow the official Kubeflow installation guide.

Configuration

zenml orchestrator register kubeflow-orch \
    --flavor=kubeflow \
    --kubernetes_context=my-k8s-context \
    --kubernetes_namespace=kubeflow \
    --kubeflow_hostname=https://kubeflow.example.com

Required Parameters:

None (uses defaults if running from within the cluster)

Optional Parameters:

kubernetes_context - kubectl context name (defaults to current context)
kubernetes_namespace - Namespace for KFP (default: kubeflow)
kubeflow_hostname - KFP API endpoint URL
synchronous - Wait for pipeline completion (default: True)
skip_local_validations - Skip kubectl checks (default: False)
skip_ui_daemon_provisioning - Don’t start local UI proxy (default: False)

Access Patterns

Local Access (Port Forwarding):

# Port forward to KFP API
kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8080:80

# Configure orchestrator
zenml orchestrator register kubeflow-local \
    --flavor=kubeflow \
    --kubeflow_hostname=http://localhost:8080

Direct Access (LoadBalancer/Ingress):

zenml orchestrator register kubeflow-prod \
    --flavor=kubeflow \
    --kubeflow_hostname=https://kubeflow.mycompany.com

In-Cluster Access: When running ZenML from within the same Kubernetes cluster:

zenml orchestrator register kubeflow-in-cluster \
    --flavor=kubeflow \
    --kubernetes_namespace=kubeflow
# No hostname needed - uses cluster service DNS

Step-Level Pod Configuration

Customize Kubernetes Pods for individual steps using KubernetesPodSettings:

from zenml import step, pipeline
from zenml.integrations.kubernetes.pod_settings import KubernetesPodSettings

@step(
    settings={
        "orchestrator.kubernetes": KubernetesPodSettings(
            node_selectors={"accelerator": "nvidia-tesla-v100"},
            resources={
                "requests": {"memory": "16Gi", "cpu": "4"},
                "limits": {"memory": "16Gi", "cpu": "4", "nvidia.com/gpu": "1"},
            },
            tolerations=[
                {
                    "key": "nvidia.com/gpu",
                    "operator": "Exists",
                    "effect": "NoSchedule",
                }
            ],
            labels={"team": "ml-ops", "project": "recommendation"},
            annotations={"sidecar.istio.io/inject": "false"},
        )
    }
)
def train_model(data: pd.DataFrame) -> Model:
    # Training with GPU
    ...

@step(
    settings={
        "orchestrator.kubernetes": KubernetesPodSettings(
            resources={
                "requests": {"memory": "4Gi", "cpu": "2"},
            },
        )
    }
)
def preprocess_data() -> pd.DataFrame:
    # Lightweight preprocessing
    ...

@pipeline
def training_pipeline():
    data = preprocess_data()
    train_model(data)

Pipeline Caching

Kubeflow Pipelines supports execution caching:

from zenml import pipeline

@pipeline(enable_cache=True)  # Default behavior
def my_pipeline():
    # Steps with identical inputs will be cached
    ...

@pipeline(enable_cache=False)
def no_cache_pipeline():
    # Always re-execute all steps
    ...

Caching behavior:

Caches at the step level based on inputs and code
Cached results are reused across pipeline runs
Cache is stored in the KFP backend
Disable caching for non-deterministic steps

Resource Management

CPU and Memory:

KubernetesPodSettings(
    resources={
        "requests": {
            "cpu": "2",      # 2 cores guaranteed
            "memory": "8Gi",  # 8GB guaranteed
        },
        "limits": {
            "cpu": "4",      # Max 4 cores
            "memory": "16Gi", # Max 16GB (OOMKilled if exceeded)
        },
    }
)

GPUs:

KubernetesPodSettings(
    resources={
        "limits": {
            "nvidia.com/gpu": "2",  # Request 2 GPUs
        }
    },
    node_selectors={
        "cloud.google.com/gke-accelerator": "nvidia-tesla-v100",
    },
    tolerations=[
        {
            "key": "nvidia.com/gpu",
            "operator": "Exists",
            "effect": "NoSchedule",
        }
    ],
)

Volume Mounts

Persistent Volumes:

KubernetesPodSettings(
    volumes=[
        {
            "name": "data-volume",
            "persistentVolumeClaim": {
                "claimName": "ml-training-data",
            },
        }
    ],
    volume_mounts=[
        {
            "name": "data-volume",
            "mountPath": "/mnt/data",
        }
    ],
)

Secrets:

KubernetesPodSettings(
    volumes=[
        {
            "name": "credentials",
            "secret": {"secretName": "aws-credentials"},
        }
    ],
    volume_mounts=[
        {
            "name": "credentials",
            "mountPath": "/etc/secrets",
            "readOnly": True,
        }
    ],
)

Complete Stack Example

# Register components
zenml container-registry register gcr-registry \
    --flavor=gcp \
    --uri=gcr.io/my-project

zenml artifact-store register gcs-store \
    --flavor=gcp \
    --path=gs://my-zenml-artifacts

zenml orchestrator register kubeflow-prod \
    --flavor=kubeflow \
    --kubernetes_context=prod-cluster \
    --kubernetes_namespace=kubeflow \
    --kubeflow_hostname=https://kubeflow.mycompany.com

# Create stack
zenml stack register kubeflow-stack \
    -o kubeflow-prod \
    -a gcs-store \
    -c gcr-registry

# Activate
zenml stack set kubeflow-stack

Authentication

Service Account Setup

Create a Kubernetes service account for pipelines:

# zenml-kfp-sa.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: zenml-pipeline-sa
  namespace: kubeflow
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: zenml-pipeline-role
  namespace: kubeflow
rules:
  - apiGroups: [""]
    resources: ["pods", "pods/log", "secrets", "configmaps"]
    verbs: ["get", "list", "watch", "create", "delete"]
  - apiGroups: [""]
    resources: ["persistentvolumeclaims"]
    verbs: ["get", "list", "create", "delete"]
  - apiGroups: ["argoproj.io"]
    resources: ["workflows"]
    verbs: ["get", "list", "watch", "create", "delete", "patch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: zenml-pipeline-binding
  namespace: kubeflow
subjects:
  - kind: ServiceAccount
    name: zenml-pipeline-sa
    namespace: kubeflow
roleRef:
  kind: Role
  name: zenml-pipeline-role
  apiGroup: rbac.authorization.k8s.io

Apply and configure:

kubectl apply -f zenml-kfp-sa.yaml

KubernetesPodSettings(service_account_name="zenml-pipeline-sa")

UI Access

Access the Kubeflow Pipelines UI to monitor runs: Port Forwarding:

kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8080:80
# Visit http://localhost:8080

ZenML Dashboard Integration: Pipeline runs automatically link to the KFP UI in the ZenML dashboard.

Best Practices

Use Minimal Docker Images

Reduce pull times with slim images:

# Use slim base images
FROM python:3.10-slim

# Install only required dependencies
RUN pip install --no-cache-dir zenml[kubeflow]

Set Resource Limits

Always set resource limits to prevent resource exhaustion:

KubernetesPodSettings(
    resources={
        "requests": {"cpu": "1", "memory": "2Gi"},
        "limits": {"cpu": "2", "memory": "4Gi"},
    }
)

Use Node Affinity for GPU Jobs

Ensure GPU jobs land on GPU nodes:

KubernetesPodSettings(
    affinity={
        "nodeAffinity": {
            "requiredDuringSchedulingIgnoredDuringExecution": {
                "nodeSelectorTerms": [
                    {
                        "matchExpressions": [
                            {
                                "key": "accelerator",
                                "operator": "In",
                                "values": ["gpu"],
                            }
                        ]
                    }
                ]
            }
        }
    }
)

Enable Pipeline Caching Selectively

Disable caching for non-deterministic steps:

@step(enable_cache=False)
def download_latest_data() -> pd.DataFrame:
    # Always fetch fresh data
    ...

Troubleshooting

Pipeline Compilation Fails

If pipeline compilation errors occur:

Check KFP version compatibility (kfp>=2.6.0)
Verify all steps have proper type hints
Ensure materializers exist for custom types
Check ZenML version matches integration version

Pods Stuck in Pending

If pods don’t start:

Check node resources: kubectl describe nodes
View pod events: kubectl describe pod -n kubeflow
Verify image pull secrets are configured
Check resource requests vs. available capacity

Cannot Connect to KFP API

If orchestrator can’t reach KFP:

Verify port forwarding is active
Check kubeflow_hostname URL is correct
Ensure firewall rules allow access
Test with curl $KUBEFLOW_HOSTNAME

Artifact Loading Fails

If steps can’t load artifacts:

Ensure artifact store is accessible from cluster
Check service account has storage permissions
Verify network policies allow egress
For cloud storage, check credentials are mounted

Differences from Kubernetes Orchestrator

Kubeflow vs. native Kubernetes orchestrator:

Feature	Kubeflow	Kubernetes
UI	KFP dashboard	None (kubectl only)
Pipeline DAG visualization	✓	✗
Caching	Built-in	Manual
Execution engine	Argo Workflows	Direct Jobs
Scheduling	Advanced	Basic
Monitoring	Extensive	Basic
Setup complexity	Higher	Lower

Next Steps

Kubernetes Integration

Compare with native Kubernetes orchestrator

Container Registries

Configure image registries

Remote Execution

Production deployment patterns

Kubeflow Docs

Official Kubeflow Pipelines documentation

Getting Started

Core Concepts

Guides

Stack Components

Integrations

Advanced

Deployment

Kubeflow Integration

Installation

Available Components

Kubeflow Orchestrator

Kubeflow Orchestrator

Prerequisites

Installing Kubeflow Pipelines

Configuration

Access Patterns

Step-Level Pod Configuration

Pipeline Caching

Resource Management

Volume Mounts

Complete Stack Example

Authentication

Service Account Setup

UI Access

Best Practices

Troubleshooting

Differences from Kubernetes Orchestrator

Next Steps

Kubernetes Integration

Container Registries

Remote Execution

Kubeflow Docs

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Guides

Stack Components

Integrations

Advanced

Deployment

​Installation

​Available Components

Kubeflow Orchestrator

​Kubeflow Orchestrator

​Prerequisites

​Installing Kubeflow Pipelines

​Configuration

​Access Patterns

​Step-Level Pod Configuration

​Pipeline Caching

​Resource Management

​Volume Mounts

​Complete Stack Example

​Authentication

​Service Account Setup

​UI Access

​Best Practices

​Troubleshooting

​Differences from Kubernetes Orchestrator

​Next Steps

Kubernetes Integration

Container Registries

Remote Execution

Kubeflow Docs

Build docs developers (and LLMs) love

Installation

Available Components

Kubeflow Orchestrator

Prerequisites

Installing Kubeflow Pipelines

Configuration

Access Patterns

Step-Level Pod Configuration

Pipeline Caching

Resource Management

Volume Mounts

Complete Stack Example

Authentication

Service Account Setup

UI Access

Best Practices

Troubleshooting

Differences from Kubernetes Orchestrator

Next Steps