Kubernetes Overview

Introduction

The Kubernetes integration allows Metaflow to execute individual steps of your flows as Kubernetes Jobs on any Kubernetes cluster. This provides a flexible, cloud-agnostic way to scale your workflows using containerized workloads.

Key Features

Container-Native

Run your workflows in Docker containers with full control over the runtime environment

Resource Management

Request specific CPU, memory, GPU, and disk resources for each step

Cloud Agnostic

Deploy on any Kubernetes cluster: AWS EKS, Azure AKS, GCP GKE, or on-premises

Multi-Node Support

Execute distributed workloads with gang-scheduled multi-node jobs using JobSets

How It Works

When you decorate a step with @kubernetes, Metaflow:

Packages your code and uploads it to your configured datastore (S3, Azure Blob, or GCS)
Creates a Kubernetes Job specification with your resource requirements
Submits the job to your Kubernetes cluster
Monitors the job execution and streams logs back to you
Retrieves results from the datastore once the job completes

from metaflow import FlowSpec, step, kubernetes

class MyFlow(FlowSpec):
    
    @step
    def start(self):
        print("Running locally")
        self.next(self.train)
    
    @kubernetes(cpu=4, memory=16000, gpu=1)
    @step
    def train(self):
        # This step runs as a Kubernetes Job
        print("Training model on Kubernetes with GPU")
        self.model = train_model()
        self.next(self.end)
    
    @step
    def end(self):
        print("Back to local execution")

if __name__ == '__main__':
    MyFlow()

Architecture

Components

Metaflow Client: Submits jobs and monitors execution from your local machine or CI/CD system. Kubernetes Control Plane: Schedules and manages pod lifecycle based on job specifications. Worker Pods: Execute your Metaflow tasks in isolated containers with specified resources. Datastore: Central storage (S3/Azure/GCS) for code packages, artifacts, and metadata.

Execution Modes

Single-Node Execution

The standard execution mode where each step runs in a single Kubernetes pod:

@kubernetes(cpu=8, memory=32000)
@step
def process(self):
    # Runs in a single pod with 8 CPUs and 32GB RAM
    result = process_data(self.data)
    self.next(self.end)

Multi-Node Execution with @parallel

For distributed workloads, combine @kubernetes with @parallel to create gang-scheduled multi-node jobs using Kubernetes JobSets:

from metaflow import kubernetes, parallel

@kubernetes(cpu=4, memory=16000)
@parallel(num_nodes=4)
@step
def distributed_training(self):
    from metaflow import current
    
    # Access node information
    print(f"Node {current.parallel.node_index} of {current.parallel.num_nodes}")
    print(f"Main IP: {current.parallel.main_ip}")
    
    # Your distributed training code here
    if current.parallel.node_index == 0:
        # Control node logic
        coordinate_training()
    else:
        # Worker node logic
        run_worker()
    
    self.next(self.end)

Supported Kubernetes Distributions

Metaflow works with any standard Kubernetes cluster:

AWS: Amazon Elastic Kubernetes Service (EKS)
Azure: Azure Kubernetes Service (AKS)
Google Cloud: Google Kubernetes Engine (GKE)
On-Premises: Self-managed Kubernetes clusters
Other: DigitalOcean Kubernetes, Linode Kubernetes Engine, etc.

Prerequisites

Kubernetes Cluster

Access to a Kubernetes cluster (1.21+) with appropriate RBAC permissions

Cloud Storage

Configured datastore: S3, Azure Blob Storage, or Google Cloud Storage

Docker Registry

Container registry for your Docker images (optional if using public images)

Python Library

Install the Kubernetes Python client:

pip install kubernetes

Configuration

Configure Metaflow to use your Kubernetes cluster by setting environment variables:

# Basic configuration
export METAFLOW_KUBERNETES_NAMESPACE=metaflow
export METAFLOW_KUBERNETES_SERVICE_ACCOUNT=metaflow-service-account

# Container configuration
export METAFLOW_KUBERNETES_CONTAINER_IMAGE=myregistry/metaflow:latest
export METAFLOW_KUBERNETES_CONTAINER_REGISTRY=myregistry.io

# Datastore configuration (choose one)
export METAFLOW_DATASTORE_SYSROOT_S3=s3://my-metaflow-bucket
# or
export METAFLOW_DATASTORE_SYSROOT_AZURE=wasbs://container@account.blob.core.windows.net/
# or
export METAFLOW_DATASTORE_SYSROOT_GS=gs://my-metaflow-bucket

For detailed configuration options, see the Configuration Guide.

Resource Specification

Specify compute resources for each step:

@kubernetes(
    cpu=4,              # Number of CPU cores
    memory=16000,       # Memory in MB
    disk=50000,         # Ephemeral disk in MB
    gpu=1,              # Number of GPUs
    gpu_vendor='nvidia' # GPU vendor: 'nvidia' or 'amd'
)
@step
def compute_intensive_step(self):
    # Your code here
    pass

Advanced Features

Custom Node Selection

Target specific nodes using node selectors:

@kubernetes(
    node_selector={
        'node.kubernetes.io/instance-type': 'g4dn.xlarge',
        'topology.kubernetes.io/zone': 'us-east-1a'
    }
)
@step
def targeted_step(self):
    pass

Tolerations

Schedule on tainted nodes:

@kubernetes(
    tolerations=[
        {
            'key': 'dedicated',
            'operator': 'Equal',
            'value': 'ml-workloads',
            'effect': 'NoSchedule'
        }
    ]
)
@step
def tolerated_step(self):
    pass

Persistent Volumes

Mount persistent volumes for shared storage:

@kubernetes(
    persistent_volume_claims={
        'data-pvc': '/mnt/data',
        'models-pvc': '/mnt/models'
    }
)
@step
def storage_step(self):
    # Access mounted volumes
    with open('/mnt/data/input.txt') as f:
        data = f.read()
    pass

Secrets Management

Access Kubernetes secrets:

@kubernetes(
    secrets=['my-db-credentials', 'api-keys']
)
@step
def secure_step(self):
    # Secrets are mounted as environment variables
    import os
    api_key = os.environ.get('API_KEY')
    pass

Monitoring and Debugging

View Running Jobs

# List all Metaflow jobs in your namespace
kubectl get jobs -n metaflow -l app.kubernetes.io/part-of=metaflow

# View job details
kubectl describe job <job-name> -n metaflow

Access Logs

Metaflow automatically streams logs during execution. You can also access them directly:

# View pod logs
kubectl logs -n metaflow <pod-name>

# Follow logs in real-time
kubectl logs -n metaflow <pod-name> -f

Debug Failed Jobs

# Get pod status
kubectl get pods -n metaflow -l metaflow/flow_name=MyFlow

# Inspect failed pod
kubectl describe pod <pod-name> -n metaflow

# View events
kubectl get events -n metaflow --sort-by='.lastTimestamp'

Best Practices

Use Appropriate Resource Limits

Set realistic resource requests to avoid over-provisioning:

Start with conservative estimates
Monitor actual usage with kubectl top pods
Adjust based on observed requirements
Use QoS classes effectively (Guaranteed vs Burstable)

Container Image Management

Optimize your Docker images:

Use multi-stage builds to reduce image size
Cache dependencies in image layers
Pin specific versions for reproducibility
Use image pull secrets for private registries

Namespace Organization

Use Kubernetes namespaces effectively:

Separate production and development workloads
Apply resource quotas per namespace
Use NetworkPolicies for isolation
Implement RBAC for access control

Cost Optimization

Reduce cloud costs:

Use spot/preemptible instances for fault-tolerant workloads
Set appropriate timeouts with @timeout decorator
Clean up completed jobs regularly
Use cluster autoscaling

Comparison with AWS Batch

Feature	Kubernetes	AWS Batch
Cloud Agnostic	✅ Yes	❌ AWS Only
Multi-Cloud	✅ Supported	❌ No
Setup Complexity	Medium	Low
Container Management	Full Control	Managed
Cost Control	Direct	Through AWS
GPU Support	✅ Yes	✅ Yes
Multi-Node Jobs	✅ JobSets	✅ Array Jobs
Spot Instances	✅ Yes	✅ Yes

AWS

Kubernetes

Other Orchestrators

Kubernetes Overview

Introduction

Key Features

Container-Native

Resource Management

Cloud Agnostic

Multi-Node Support

How It Works

Architecture

Components

Execution Modes

Single-Node Execution

Multi-Node Execution with @parallel

Supported Kubernetes Distributions

Prerequisites

Configuration

Resource Specification

Advanced Features

Custom Node Selection

Tolerations

Persistent Volumes

Secrets Management

Monitoring and Debugging

View Running Jobs

Access Logs

Debug Failed Jobs

Best Practices

Comparison with AWS Batch

Next Steps

Argo Workflows

Configuration

Best Practices

Debugging

Additional Resources

Build docs developers (and LLMs) love

AWS

Kubernetes

Other Orchestrators

​Introduction

​Key Features

Container-Native

Resource Management

Cloud Agnostic

Multi-Node Support

​How It Works

​Architecture

​Components

​Execution Modes

​Single-Node Execution

​Multi-Node Execution with @parallel

​Supported Kubernetes Distributions

​Prerequisites

​Configuration

​Resource Specification

​Advanced Features

​Custom Node Selection

​Tolerations

​Persistent Volumes

​Secrets Management

​Monitoring and Debugging

​View Running Jobs

​Access Logs

​Debug Failed Jobs

​Best Practices

​Comparison with AWS Batch

​Next Steps

Argo Workflows

Configuration

Best Practices

Debugging

​Additional Resources

Build docs developers (and LLMs) love

Introduction

Key Features

How It Works

Architecture

Components

Execution Modes

Single-Node Execution

Multi-Node Execution with @parallel

Supported Kubernetes Distributions

Prerequisites

Configuration

Resource Specification

Advanced Features

Custom Node Selection

Tolerations

Persistent Volumes

Secrets Management

Monitoring and Debugging

View Running Jobs

Access Logs

Debug Failed Jobs

Best Practices

Comparison with AWS Batch

Next Steps

Additional Resources