Kubernetes Configuration

Configuration Overview

Metaflow’s Kubernetes integration is configured through environment variables and decorator parameters. This page provides a comprehensive reference for all available configuration options.

Environment Variables

Basic Configuration

METAFLOW_KUBERNETES_NAMESPACE

string

default:"default"

Kubernetes namespace for running Metaflow jobs.

export METAFLOW_KUBERNETES_NAMESPACE=metaflow-prod

METAFLOW_KUBERNETES_SERVICE_ACCOUNT

string

default:"default"

Kubernetes ServiceAccount to use for pod execution. This account needs appropriate RBAC permissions.

export METAFLOW_KUBERNETES_SERVICE_ACCOUNT=metaflow-runner

Container Configuration

METAFLOW_KUBERNETES_CONTAINER_IMAGE

string

Default Docker image for Metaflow tasks. If not specified, a vanilla Python image matching your local Python version is used.

export METAFLOW_KUBERNETES_CONTAINER_IMAGE=myregistry.io/metaflow:latest

METAFLOW_KUBERNETES_CONTAINER_REGISTRY

string

Container registry URL prepended to image names that don’t include a registry.

export METAFLOW_KUBERNETES_CONTAINER_REGISTRY=myregistry.io

METAFLOW_KUBERNETES_IMAGE_PULL_POLICY

string

default:"IfNotPresent"

Image pull policy for containers. Options: Always, IfNotPresent, Never.

export METAFLOW_KUBERNETES_IMAGE_PULL_POLICY=Always

METAFLOW_KUBERNETES_IMAGE_PULL_SECRETS

json

JSON list of image pull secret names for accessing private registries.

export METAFLOW_KUBERNETES_IMAGE_PULL_SECRETS='["regcred", "docker-registry-secret"]'

Resource Defaults

METAFLOW_KUBERNETES_CPU

string

default:"1"

Default CPU request for Kubernetes pods (in cores).

export METAFLOW_KUBERNETES_CPU=2

METAFLOW_KUBERNETES_MEMORY

string

default:"4096"

Default memory request for Kubernetes pods (in MB).

export METAFLOW_KUBERNETES_MEMORY=8192

METAFLOW_KUBERNETES_DISK

string

default:"10240"

Default ephemeral disk request for Kubernetes pods (in MB).

export METAFLOW_KUBERNETES_DISK=20480

METAFLOW_KUBERNETES_GPU_VENDOR

string

default:"nvidia"

Default GPU vendor. Options: nvidia, amd.

export METAFLOW_KUBERNETES_GPU_VENDOR=nvidia

Storage Configuration

METAFLOW_DATASTORE_SYSROOT_S3

string

S3 bucket URL for storing Metaflow artifacts and code packages.

export METAFLOW_DATASTORE_SYSROOT_S3=s3://my-metaflow-bucket/metaflow

METAFLOW_DATASTORE_SYSROOT_AZURE

string

Azure Blob Storage URL for storing Metaflow artifacts.

export METAFLOW_DATASTORE_SYSROOT_AZURE=wasbs://container@account.blob.core.windows.net/metaflow

METAFLOW_DATASTORE_SYSROOT_GS

string

Google Cloud Storage bucket URL for storing Metaflow artifacts.

export METAFLOW_DATASTORE_SYSROOT_GS=gs://my-metaflow-bucket/metaflow

Scheduling and Node Selection

METAFLOW_KUBERNETES_NODE_SELECTOR

string

Comma-separated list of node selector key-value pairs.

export METAFLOW_KUBERNETES_NODE_SELECTOR="node.kubernetes.io/instance-type=m5.xlarge,topology.kubernetes.io/zone=us-east-1a"

METAFLOW_KUBERNETES_TOLERATIONS

json

JSON list of tolerations for scheduling pods on tainted nodes.

export METAFLOW_KUBERNETES_TOLERATIONS='[{"key":"dedicated","operator":"Equal","value":"ml","effect":"NoSchedule"}]'

METAFLOW_KUBERNETES_QOS

string

default:"Burstable"

Default Quality of Service class for pods. Options: Guaranteed, Burstable.

export METAFLOW_KUBERNETES_QOS=Guaranteed

Storage and Volumes

METAFLOW_KUBERNETES_PERSISTENT_VOLUME_CLAIMS

json

JSON object mapping PVC names to mount paths.

export METAFLOW_KUBERNETES_PERSISTENT_VOLUME_CLAIMS='{"data-pvc":"/mnt/data","models-pvc":"/mnt/models"}'

METAFLOW_KUBERNETES_SHARED_MEMORY

integer

Shared memory size in MB for /dev/shm.

export METAFLOW_KUBERNETES_SHARED_MEMORY=8192

Security

METAFLOW_KUBERNETES_SECRETS

string

Comma-separated list of Kubernetes secret names to mount as environment variables.

export METAFLOW_KUBERNETES_SECRETS="api-keys,database-credentials"

Labels and Annotations

METAFLOW_KUBERNETES_LABELS

string

Comma-separated list of labels in key=value format.

export METAFLOW_KUBERNETES_LABELS="team=ml,environment=production,cost-center=engineering"

METAFLOW_KUBERNETES_ANNOTATIONS

string

Comma-separated list of annotations in key=value format.

export METAFLOW_KUBERNETES_ANNOTATIONS="prometheus.io/scrape=true,prometheus.io/port=8080"

Advanced Options

METAFLOW_KUBERNETES_PORT

integer

Port number to expose from containers (used with @parallel for multi-node communication).

export METAFLOW_KUBERNETES_PORT=29500

METAFLOW_KUBERNETES_FETCH_EC2_METADATA

boolean

default:"false"

Fetch EC2 instance metadata when running on AWS EKS.

export METAFLOW_KUBERNETES_FETCH_EC2_METADATA=true

METAFLOW_KUBERNETES_SANDBOX_INIT_SCRIPT

string

Shell script to execute before task initialization (for custom environment setup).

export METAFLOW_KUBERNETES_SANDBOX_INIT_SCRIPT='echo "Setting up environment" && pip install custom-package'

Argo Workflows Configuration

METAFLOW_ARGO_WORKFLOWS_UI_URL

string

URL for the Argo Workflows UI.

export METAFLOW_ARGO_WORKFLOWS_UI_URL=https://argo.example.com

METAFLOW_ARGO_WORKFLOWS_KUBERNETES_SECRETS

string

Comma-separated list of secrets to mount in Argo Workflows.

export METAFLOW_ARGO_WORKFLOWS_KUBERNETES_SECRETS="workflow-secrets"

METAFLOW_ARGO_EVENTS_WEBHOOK_URL

string

Webhook URL for Argo Events integration.

export METAFLOW_ARGO_EVENTS_WEBHOOK_URL=http://webhook-eventsource-svc.argo-events:12000/webhook

METAFLOW_ARGO_EVENTS_EVENT_BUS

string

default:"default"

Name of the Argo Events EventBus.

export METAFLOW_ARGO_EVENTS_EVENT_BUS=metaflow-events

METAFLOW_ARGO_EVENTS_SENSOR_NAMESPACE

string

Namespace for Argo Events Sensors.

export METAFLOW_ARGO_EVENTS_SENSOR_NAMESPACE=argo-events

Decorator Parameters

The @kubernetes decorator accepts the following parameters to override defaults:

Resource Specification

@kubernetes(
    cpu=8,                  # Number of CPU cores (int or float)
    memory=32000,           # Memory in MB (int)
    disk=50000,             # Ephemeral disk in MB (int)
    gpu=2,                  # Number of GPUs (int)
    gpu_vendor='nvidia',    # GPU vendor: 'nvidia' or 'amd'
    shared_memory=8192      # Shared memory in MB (int)
)

Container Configuration

@kubernetes(
    image='custom-image:v1.0',              # Docker image
    image_pull_policy='Always',             # Pull policy
    image_pull_secrets=['regcred']          # Image pull secrets (list)
)

Scheduling and Placement

@kubernetes(
    namespace='custom-namespace',           # Kubernetes namespace
    service_account='custom-sa',            # Service account
    node_selector={                         # Node selector (dict)
        'node.kubernetes.io/instance-type': 'g4dn.xlarge',
        'topology.kubernetes.io/zone': 'us-east-1a'
    },
    tolerations=[                           # Tolerations (list of dicts)
        {
            'key': 'dedicated',
            'operator': 'Equal',
            'value': 'ml-workloads',
            'effect': 'NoSchedule'
        }
    ],
    qos='Guaranteed'                        # QoS class: 'Guaranteed' or 'Burstable'
)

Storage and Volumes

@kubernetes(
    persistent_volume_claims={              # PVC mounts (dict)
        'data-pvc': '/mnt/data',
        'models-pvc': '/mnt/models'
    },
    use_tmpfs=True,                         # Enable tmpfs
    tmpfs_size=10240,                       # tmpfs size in MB
    tmpfs_path='/tmp',                      # tmpfs mount path
    tmpfs_tempdir=True                      # Use tmpfs for METAFLOW_TEMPDIR
)

Security

@kubernetes(
    secrets=['api-keys', 'db-creds'],       # Kubernetes secrets (list)
    security_context={                       # Security context (dict)
        'run_as_user': 1000,
        'run_as_group': 1000,
        'run_as_non_root': True,
        'privileged': False
    }
)

Metadata

@kubernetes(
    labels={                                 # Pod labels (dict)
        'team': 'ml',
        'environment': 'production'
    },
    annotations={                            # Pod annotations (dict)
        'prometheus.io/scrape': 'true',
        'prometheus.io/port': '8080'
    }
)

Advanced Options

@kubernetes(
    port=29500,                             # Port for multi-node communication
    compute_pool='high-priority'            # Compute pool (maps to node selector)
)

Complete Configuration Example

Here’s a complete example showing environment variables and decorator usage:

Environment Setup

# config/kubernetes.env

# Cluster configuration
export METAFLOW_KUBERNETES_NAMESPACE=metaflow-prod
export METAFLOW_KUBERNETES_SERVICE_ACCOUNT=metaflow-runner

# Container defaults
export METAFLOW_KUBERNETES_CONTAINER_IMAGE=myregistry.io/metaflow:2.10.0
export METAFLOW_KUBERNETES_CONTAINER_REGISTRY=myregistry.io
export METAFLOW_KUBERNETES_IMAGE_PULL_POLICY=IfNotPresent
export METAFLOW_KUBERNETES_IMAGE_PULL_SECRETS='["regcred"]'

# Resource defaults
export METAFLOW_KUBERNETES_CPU=4
export METAFLOW_KUBERNETES_MEMORY=16384
export METAFLOW_KUBERNETES_DISK=20480

# Storage
export METAFLOW_DATASTORE_SYSROOT_S3=s3://metaflow-prod-bucket/metaflow

# Scheduling
export METAFLOW_KUBERNETES_NODE_SELECTOR="node.kubernetes.io/purpose=metaflow"
export METAFLOW_KUBERNETES_TOLERATIONS='[{"key":"metaflow","operator":"Equal","value":"true","effect":"NoSchedule"}]'

# Security
export METAFLOW_KUBERNETES_SECRETS=default-secrets

# Labels
export METAFLOW_KUBERNETES_LABELS="team=ml,cost-center=engineering"

Flow Implementation

from metaflow import FlowSpec, step, kubernetes, Parameter

class ProductionMLFlow(FlowSpec):
    
    model_version = Parameter('model_version', default='v1.0')
    
    @step
    def start(self):
        # Runs locally or on Argo
        print(f"Starting training for model {self.model_version}")
        self.next(self.prepare)
    
    @kubernetes(
        # Override defaults for data preparation
        cpu=2,
        memory=8192,
        labels={'step': 'data-prep'}
    )
    @step
    def prepare(self):
        # Light data preparation
        self.train_data = load_and_prepare_data()
        self.next(self.train)
    
    @kubernetes(
        # Heavy training workload
        cpu=16,
        memory=65536,
        gpu=4,
        gpu_vendor='nvidia',
        disk=102400,
        node_selector={
            'node.kubernetes.io/instance-type': 'p3.8xlarge',
            'topology.kubernetes.io/zone': 'us-east-1a'
        },
        persistent_volume_claims={
            'training-cache': '/mnt/cache'
        },
        shared_memory=16384,
        labels={'step': 'training', 'gpu': 'true'},
        annotations={'nvidia.com/gpu-memory': '32GB'}
    )
    @step
    def train(self):
        # Train model with GPUs
        self.model = train_model(
            self.train_data,
            version=self.model_version
        )
        self.next(self.evaluate)
    
    @kubernetes(
        cpu=4,
        memory=16384,
        labels={'step': 'evaluation'}
    )
    @step
    def evaluate(self):
        # Evaluate model
        self.metrics = evaluate_model(self.model)
        self.next(self.end)
    
    @step
    def end(self):
        print(f"Training complete. Accuracy: {self.metrics['accuracy']}")

if __name__ == '__main__':
    ProductionMLFlow()

Quality of Service (QoS) Classes

Kubernetes assigns QoS classes based on resource requests and limits:

Guaranteed QoS

@kubernetes(
    cpu=8,
    memory=32000,
    qos='Guaranteed'  # Sets requests = limits
)
@step
def guaranteed_step(self):
    # This pod gets highest priority and is never killed
    # unless it exceeds its limits
    pass

Characteristics:

CPU and memory limits are set equal to requests
Highest priority for scheduling
Last to be evicted under resource pressure
Best for critical, resource-intensive workloads

Burstable QoS (Default)

@kubernetes(
    cpu=4,
    memory=16000,
    qos='Burstable'  # Sets requests without limits (default)
)
@step
def burstable_step(self):
    # This pod can use more resources than requested
    # if available on the node
    pass

Characteristics:

CPU and memory requests set, but can burst above
Medium priority for scheduling
Can be throttled or evicted if resources are needed
Good for most workloads with variable resource usage

Multi-Node Configuration

For distributed workloads using @parallel:

from metaflow import kubernetes, parallel

@kubernetes(
    cpu=8,
    memory=32000,
    gpu=1,
    port=29500,  # Port for inter-node communication
    shared_memory=16384  # Required for distributed frameworks
)
@parallel(num_nodes=4)
@step
def distributed_step(self):
    from metaflow import current
    
    # Access multi-node information
    print(f"Node: {current.parallel.node_index}/{current.parallel.num_nodes}")
    print(f"Main IP: {current.parallel.main_ip}")
    
    # Set up distributed training
    setup_distributed(
        rank=current.parallel.node_index,
        world_size=current.parallel.num_nodes,
        master_addr=current.parallel.main_ip,
        master_port=29500
    )
    
    self.next(self.end)

Security Context Configuration

Configure container security context:

@kubernetes(
    security_context={
        'run_as_user': 1000,           # Run as non-root user
        'run_as_group': 1000,          # Primary group
        'run_as_non_root': True,       # Enforce non-root
        'privileged': False,           # No privileged mode
        'allow_privilege_escalation': False
    }
)
@step
def secure_step(self):
    # Runs with restricted permissions
    pass

Troubleshooting Configuration Issues

Image Pull Errors

Symptoms: Pods stuck in ImagePullBackOff or ErrImagePullSolutions:

Verify image name and registry:

kubectl get events -n metaflow | grep -i pull

Check image pull secrets:

kubectl get secrets -n metaflow
kubectl describe serviceaccount metaflow-sa -n metaflow

Test image pull manually:

kubectl run test --image=myregistry.io/metaflow:latest --dry-run=client -o yaml

Resource Quota Exceeded

Symptoms: Pods stuck in Pending with reason Quota ExceededSolutions:

Check namespace quotas:
```
kubectl describe quota -n metaflow
```

View current resource usage:

kubectl top pods -n metaflow
kubectl top nodes

Request quota increase or reduce resource requests

Scheduling Failures

Symptoms: Pods stuck in Pending with reason UnschedulableSolutions:

Check node selector and tolerations:

kubectl describe pod <pod-name> -n metaflow

Verify node availability:

kubectl get nodes -l node.kubernetes.io/instance-type=m5.xlarge

Check taints:

kubectl describe nodes | grep -A5 Taints

PVC Mount Failures

Symptoms: Pods stuck in ContainerCreating with volume mount errorsSolutions:

Verify PVC exists and is bound:

kubectl get pvc -n metaflow
kubectl describe pvc <pvc-name> -n metaflow

Check storage class:
```
kubectl get storageclass
```

Ensure PVC access mode is compatible:

kubectl get pvc <pvc-name> -n metaflow -o yaml | grep accessModes

Configuration Best Practices

Environment Separation

Use different namespaces and configurations for dev, staging, and production:

# dev.env
export METAFLOW_KUBERNETES_NAMESPACE=metaflow-dev

# prod.env
export METAFLOW_KUBERNETES_NAMESPACE=metaflow-prod

Resource Defaults

Set conservative defaults in environment variables and override in decorators as needed for specific steps.

Security First

Always configure security context, use secrets for sensitive data, and follow the principle of least privilege.

Labels and Annotations

Use consistent labeling for cost tracking, monitoring, and organization:

export METAFLOW_KUBERNETES_LABELS="team=ml,project=fraud-detection,environment=prod"

Next Steps

Kubernetes Overview

Learn about Kubernetes execution concepts

Argo Workflows

Deploy production workflows with Argo

Resource Management

Optimize resource allocation

Secrets Management

Secure sensitive data

AWS

Kubernetes

Other Orchestrators

​Configuration Overview

​Environment Variables

​Basic Configuration

​Container Configuration

​Resource Defaults

​Storage Configuration

​Scheduling and Node Selection

​Storage and Volumes

​Security

​Labels and Annotations

​Advanced Options

​Argo Workflows Configuration

​Decorator Parameters

​Resource Specification

​Container Configuration

​Scheduling and Placement

​Storage and Volumes

​Security

​Metadata

​Advanced Options

​Complete Configuration Example

​Environment Setup

​Flow Implementation

​Quality of Service (QoS) Classes

​Guaranteed QoS

​Burstable QoS (Default)

​Multi-Node Configuration

​Security Context Configuration

​Troubleshooting Configuration Issues

​Configuration Best Practices

Environment Separation

Resource Defaults

Security First

Labels and Annotations

​Next Steps

Kubernetes Overview

Argo Workflows

Resource Management

Secrets Management

Build docs developers (and LLMs) love

Configuration Overview

Environment Variables

Basic Configuration

Container Configuration

Resource Defaults

Storage Configuration

Scheduling and Node Selection

Storage and Volumes

Security

Labels and Annotations

Advanced Options

Argo Workflows Configuration

Decorator Parameters

Resource Specification

Container Configuration

Scheduling and Placement

Storage and Volumes

Security

Metadata

Advanced Options

Complete Configuration Example

Environment Setup

Flow Implementation

Quality of Service (QoS) Classes

Guaranteed QoS

Burstable QoS (Default)

Multi-Node Configuration

Security Context Configuration

Troubleshooting Configuration Issues

Configuration Best Practices

Next Steps