Skip to main content

Introduction

The Kubernetes integration allows Metaflow to execute individual steps of your flows as Kubernetes Jobs on any Kubernetes cluster. This provides a flexible, cloud-agnostic way to scale your workflows using containerized workloads.

Key Features

Container-Native

Run your workflows in Docker containers with full control over the runtime environment

Resource Management

Request specific CPU, memory, GPU, and disk resources for each step

Cloud Agnostic

Deploy on any Kubernetes cluster: AWS EKS, Azure AKS, GCP GKE, or on-premises

Multi-Node Support

Execute distributed workloads with gang-scheduled multi-node jobs using JobSets

How It Works

When you decorate a step with @kubernetes, Metaflow:
  1. Packages your code and uploads it to your configured datastore (S3, Azure Blob, or GCS)
  2. Creates a Kubernetes Job specification with your resource requirements
  3. Submits the job to your Kubernetes cluster
  4. Monitors the job execution and streams logs back to you
  5. Retrieves results from the datastore once the job completes
from metaflow import FlowSpec, step, kubernetes

class MyFlow(FlowSpec):
    
    @step
    def start(self):
        print("Running locally")
        self.next(self.train)
    
    @kubernetes(cpu=4, memory=16000, gpu=1)
    @step
    def train(self):
        # This step runs as a Kubernetes Job
        print("Training model on Kubernetes with GPU")
        self.model = train_model()
        self.next(self.end)
    
    @step
    def end(self):
        print("Back to local execution")

if __name__ == '__main__':
    MyFlow()

Architecture

Kubernetes Architecture

Components

Metaflow Client: Submits jobs and monitors execution from your local machine or CI/CD system. Kubernetes Control Plane: Schedules and manages pod lifecycle based on job specifications. Worker Pods: Execute your Metaflow tasks in isolated containers with specified resources. Datastore: Central storage (S3/Azure/GCS) for code packages, artifacts, and metadata.

Execution Modes

Single-Node Execution

The standard execution mode where each step runs in a single Kubernetes pod:
@kubernetes(cpu=8, memory=32000)
@step
def process(self):
    # Runs in a single pod with 8 CPUs and 32GB RAM
    result = process_data(self.data)
    self.next(self.end)

Multi-Node Execution with @parallel

For distributed workloads, combine @kubernetes with @parallel to create gang-scheduled multi-node jobs using Kubernetes JobSets:
from metaflow import kubernetes, parallel

@kubernetes(cpu=4, memory=16000)
@parallel(num_nodes=4)
@step
def distributed_training(self):
    from metaflow import current
    
    # Access node information
    print(f"Node {current.parallel.node_index} of {current.parallel.num_nodes}")
    print(f"Main IP: {current.parallel.main_ip}")
    
    # Your distributed training code here
    if current.parallel.node_index == 0:
        # Control node logic
        coordinate_training()
    else:
        # Worker node logic
        run_worker()
    
    self.next(self.end)

Supported Kubernetes Distributions

Metaflow works with any standard Kubernetes cluster:
  • AWS: Amazon Elastic Kubernetes Service (EKS)
  • Azure: Azure Kubernetes Service (AKS)
  • Google Cloud: Google Kubernetes Engine (GKE)
  • On-Premises: Self-managed Kubernetes clusters
  • Other: DigitalOcean Kubernetes, Linode Kubernetes Engine, etc.

Prerequisites

1

Kubernetes Cluster

Access to a Kubernetes cluster (1.21+) with appropriate RBAC permissions
2

Cloud Storage

Configured datastore: S3, Azure Blob Storage, or Google Cloud Storage
3

Docker Registry

Container registry for your Docker images (optional if using public images)
4

Python Library

Install the Kubernetes Python client:
pip install kubernetes

Configuration

Configure Metaflow to use your Kubernetes cluster by setting environment variables:
# Basic configuration
export METAFLOW_KUBERNETES_NAMESPACE=metaflow
export METAFLOW_KUBERNETES_SERVICE_ACCOUNT=metaflow-service-account

# Container configuration
export METAFLOW_KUBERNETES_CONTAINER_IMAGE=myregistry/metaflow:latest
export METAFLOW_KUBERNETES_CONTAINER_REGISTRY=myregistry.io

# Datastore configuration (choose one)
export METAFLOW_DATASTORE_SYSROOT_S3=s3://my-metaflow-bucket
# or
export METAFLOW_DATASTORE_SYSROOT_AZURE=wasbs://container@account.blob.core.windows.net/
# or
export METAFLOW_DATASTORE_SYSROOT_GS=gs://my-metaflow-bucket
For detailed configuration options, see the Configuration Guide.

Resource Specification

Specify compute resources for each step:
@kubernetes(
    cpu=4,              # Number of CPU cores
    memory=16000,       # Memory in MB
    disk=50000,         # Ephemeral disk in MB
    gpu=1,              # Number of GPUs
    gpu_vendor='nvidia' # GPU vendor: 'nvidia' or 'amd'
)
@step
def compute_intensive_step(self):
    # Your code here
    pass

Advanced Features

Custom Node Selection

Target specific nodes using node selectors:
@kubernetes(
    node_selector={
        'node.kubernetes.io/instance-type': 'g4dn.xlarge',
        'topology.kubernetes.io/zone': 'us-east-1a'
    }
)
@step
def targeted_step(self):
    pass

Tolerations

Schedule on tainted nodes:
@kubernetes(
    tolerations=[
        {
            'key': 'dedicated',
            'operator': 'Equal',
            'value': 'ml-workloads',
            'effect': 'NoSchedule'
        }
    ]
)
@step
def tolerated_step(self):
    pass

Persistent Volumes

Mount persistent volumes for shared storage:
@kubernetes(
    persistent_volume_claims={
        'data-pvc': '/mnt/data',
        'models-pvc': '/mnt/models'
    }
)
@step
def storage_step(self):
    # Access mounted volumes
    with open('/mnt/data/input.txt') as f:
        data = f.read()
    pass

Secrets Management

Access Kubernetes secrets:
@kubernetes(
    secrets=['my-db-credentials', 'api-keys']
)
@step
def secure_step(self):
    # Secrets are mounted as environment variables
    import os
    api_key = os.environ.get('API_KEY')
    pass

Monitoring and Debugging

View Running Jobs

# List all Metaflow jobs in your namespace
kubectl get jobs -n metaflow -l app.kubernetes.io/part-of=metaflow

# View job details
kubectl describe job <job-name> -n metaflow

Access Logs

Metaflow automatically streams logs during execution. You can also access them directly:
# View pod logs
kubectl logs -n metaflow <pod-name>

# Follow logs in real-time
kubectl logs -n metaflow <pod-name> -f

Debug Failed Jobs

# Get pod status
kubectl get pods -n metaflow -l metaflow/flow_name=MyFlow

# Inspect failed pod
kubectl describe pod <pod-name> -n metaflow

# View events
kubectl get events -n metaflow --sort-by='.lastTimestamp'

Best Practices

Set realistic resource requests to avoid over-provisioning:
  • Start with conservative estimates
  • Monitor actual usage with kubectl top pods
  • Adjust based on observed requirements
  • Use QoS classes effectively (Guaranteed vs Burstable)
Optimize your Docker images:
  • Use multi-stage builds to reduce image size
  • Cache dependencies in image layers
  • Pin specific versions for reproducibility
  • Use image pull secrets for private registries
Use Kubernetes namespaces effectively:
  • Separate production and development workloads
  • Apply resource quotas per namespace
  • Use NetworkPolicies for isolation
  • Implement RBAC for access control
Reduce cloud costs:
  • Use spot/preemptible instances for fault-tolerant workloads
  • Set appropriate timeouts with @timeout decorator
  • Clean up completed jobs regularly
  • Use cluster autoscaling

Comparison with AWS Batch

FeatureKubernetesAWS Batch
Cloud Agnostic✅ Yes❌ AWS Only
Multi-Cloud✅ Supported❌ No
Setup ComplexityMediumLow
Container ManagementFull ControlManaged
Cost ControlDirectThrough AWS
GPU Support✅ Yes✅ Yes
Multi-Node Jobs✅ JobSets✅ Array Jobs
Spot Instances✅ Yes✅ Yes

Next Steps

Argo Workflows

Deploy production workflows with Argo Workflows orchestrator

Configuration

Explore all configuration options and environment variables

Best Practices

Learn best practices for production deployments

Debugging

Debug and troubleshoot Kubernetes execution issues

Additional Resources

Build docs developers (and LLMs) love