Skip to main content

Configuration Overview

Metaflow’s Kubernetes integration is configured through environment variables and decorator parameters. This page provides a comprehensive reference for all available configuration options.

Environment Variables

Basic Configuration

METAFLOW_KUBERNETES_NAMESPACE
string
default:"default"
Kubernetes namespace for running Metaflow jobs.
export METAFLOW_KUBERNETES_NAMESPACE=metaflow-prod
METAFLOW_KUBERNETES_SERVICE_ACCOUNT
string
default:"default"
Kubernetes ServiceAccount to use for pod execution. This account needs appropriate RBAC permissions.
export METAFLOW_KUBERNETES_SERVICE_ACCOUNT=metaflow-runner

Container Configuration

METAFLOW_KUBERNETES_CONTAINER_IMAGE
string
Default Docker image for Metaflow tasks. If not specified, a vanilla Python image matching your local Python version is used.
export METAFLOW_KUBERNETES_CONTAINER_IMAGE=myregistry.io/metaflow:latest
METAFLOW_KUBERNETES_CONTAINER_REGISTRY
string
Container registry URL prepended to image names that don’t include a registry.
export METAFLOW_KUBERNETES_CONTAINER_REGISTRY=myregistry.io
METAFLOW_KUBERNETES_IMAGE_PULL_POLICY
string
default:"IfNotPresent"
Image pull policy for containers. Options: Always, IfNotPresent, Never.
export METAFLOW_KUBERNETES_IMAGE_PULL_POLICY=Always
METAFLOW_KUBERNETES_IMAGE_PULL_SECRETS
json
JSON list of image pull secret names for accessing private registries.
export METAFLOW_KUBERNETES_IMAGE_PULL_SECRETS='["regcred", "docker-registry-secret"]'

Resource Defaults

METAFLOW_KUBERNETES_CPU
string
default:"1"
Default CPU request for Kubernetes pods (in cores).
export METAFLOW_KUBERNETES_CPU=2
METAFLOW_KUBERNETES_MEMORY
string
default:"4096"
Default memory request for Kubernetes pods (in MB).
export METAFLOW_KUBERNETES_MEMORY=8192
METAFLOW_KUBERNETES_DISK
string
default:"10240"
Default ephemeral disk request for Kubernetes pods (in MB).
export METAFLOW_KUBERNETES_DISK=20480
METAFLOW_KUBERNETES_GPU_VENDOR
string
default:"nvidia"
Default GPU vendor. Options: nvidia, amd.
export METAFLOW_KUBERNETES_GPU_VENDOR=nvidia

Storage Configuration

METAFLOW_DATASTORE_SYSROOT_S3
string
S3 bucket URL for storing Metaflow artifacts and code packages.
export METAFLOW_DATASTORE_SYSROOT_S3=s3://my-metaflow-bucket/metaflow
METAFLOW_DATASTORE_SYSROOT_AZURE
string
Azure Blob Storage URL for storing Metaflow artifacts.
export METAFLOW_DATASTORE_SYSROOT_AZURE=wasbs://container@account.blob.core.windows.net/metaflow
METAFLOW_DATASTORE_SYSROOT_GS
string
Google Cloud Storage bucket URL for storing Metaflow artifacts.
export METAFLOW_DATASTORE_SYSROOT_GS=gs://my-metaflow-bucket/metaflow

Scheduling and Node Selection

METAFLOW_KUBERNETES_NODE_SELECTOR
string
Comma-separated list of node selector key-value pairs.
export METAFLOW_KUBERNETES_NODE_SELECTOR="node.kubernetes.io/instance-type=m5.xlarge,topology.kubernetes.io/zone=us-east-1a"
METAFLOW_KUBERNETES_TOLERATIONS
json
JSON list of tolerations for scheduling pods on tainted nodes.
export METAFLOW_KUBERNETES_TOLERATIONS='[{"key":"dedicated","operator":"Equal","value":"ml","effect":"NoSchedule"}]'
METAFLOW_KUBERNETES_QOS
string
default:"Burstable"
Default Quality of Service class for pods. Options: Guaranteed, Burstable.
export METAFLOW_KUBERNETES_QOS=Guaranteed

Storage and Volumes

METAFLOW_KUBERNETES_PERSISTENT_VOLUME_CLAIMS
json
JSON object mapping PVC names to mount paths.
export METAFLOW_KUBERNETES_PERSISTENT_VOLUME_CLAIMS='{"data-pvc":"/mnt/data","models-pvc":"/mnt/models"}'
METAFLOW_KUBERNETES_SHARED_MEMORY
integer
Shared memory size in MB for /dev/shm.
export METAFLOW_KUBERNETES_SHARED_MEMORY=8192

Security

METAFLOW_KUBERNETES_SECRETS
string
Comma-separated list of Kubernetes secret names to mount as environment variables.
export METAFLOW_KUBERNETES_SECRETS="api-keys,database-credentials"

Labels and Annotations

METAFLOW_KUBERNETES_LABELS
string
Comma-separated list of labels in key=value format.
export METAFLOW_KUBERNETES_LABELS="team=ml,environment=production,cost-center=engineering"
METAFLOW_KUBERNETES_ANNOTATIONS
string
Comma-separated list of annotations in key=value format.
export METAFLOW_KUBERNETES_ANNOTATIONS="prometheus.io/scrape=true,prometheus.io/port=8080"

Advanced Options

METAFLOW_KUBERNETES_PORT
integer
Port number to expose from containers (used with @parallel for multi-node communication).
export METAFLOW_KUBERNETES_PORT=29500
METAFLOW_KUBERNETES_FETCH_EC2_METADATA
boolean
default:"false"
Fetch EC2 instance metadata when running on AWS EKS.
export METAFLOW_KUBERNETES_FETCH_EC2_METADATA=true
METAFLOW_KUBERNETES_SANDBOX_INIT_SCRIPT
string
Shell script to execute before task initialization (for custom environment setup).
export METAFLOW_KUBERNETES_SANDBOX_INIT_SCRIPT='echo "Setting up environment" && pip install custom-package'

Argo Workflows Configuration

METAFLOW_ARGO_WORKFLOWS_UI_URL
string
URL for the Argo Workflows UI.
export METAFLOW_ARGO_WORKFLOWS_UI_URL=https://argo.example.com
METAFLOW_ARGO_WORKFLOWS_KUBERNETES_SECRETS
string
Comma-separated list of secrets to mount in Argo Workflows.
export METAFLOW_ARGO_WORKFLOWS_KUBERNETES_SECRETS="workflow-secrets"
METAFLOW_ARGO_EVENTS_WEBHOOK_URL
string
Webhook URL for Argo Events integration.
export METAFLOW_ARGO_EVENTS_WEBHOOK_URL=http://webhook-eventsource-svc.argo-events:12000/webhook
METAFLOW_ARGO_EVENTS_EVENT_BUS
string
default:"default"
Name of the Argo Events EventBus.
export METAFLOW_ARGO_EVENTS_EVENT_BUS=metaflow-events
METAFLOW_ARGO_EVENTS_SENSOR_NAMESPACE
string
Namespace for Argo Events Sensors.
export METAFLOW_ARGO_EVENTS_SENSOR_NAMESPACE=argo-events

Decorator Parameters

The @kubernetes decorator accepts the following parameters to override defaults:

Resource Specification

@kubernetes(
    cpu=8,                  # Number of CPU cores (int or float)
    memory=32000,           # Memory in MB (int)
    disk=50000,             # Ephemeral disk in MB (int)
    gpu=2,                  # Number of GPUs (int)
    gpu_vendor='nvidia',    # GPU vendor: 'nvidia' or 'amd'
    shared_memory=8192      # Shared memory in MB (int)
)

Container Configuration

@kubernetes(
    image='custom-image:v1.0',              # Docker image
    image_pull_policy='Always',             # Pull policy
    image_pull_secrets=['regcred']          # Image pull secrets (list)
)

Scheduling and Placement

@kubernetes(
    namespace='custom-namespace',           # Kubernetes namespace
    service_account='custom-sa',            # Service account
    node_selector={                         # Node selector (dict)
        'node.kubernetes.io/instance-type': 'g4dn.xlarge',
        'topology.kubernetes.io/zone': 'us-east-1a'
    },
    tolerations=[                           # Tolerations (list of dicts)
        {
            'key': 'dedicated',
            'operator': 'Equal',
            'value': 'ml-workloads',
            'effect': 'NoSchedule'
        }
    ],
    qos='Guaranteed'                        # QoS class: 'Guaranteed' or 'Burstable'
)

Storage and Volumes

@kubernetes(
    persistent_volume_claims={              # PVC mounts (dict)
        'data-pvc': '/mnt/data',
        'models-pvc': '/mnt/models'
    },
    use_tmpfs=True,                         # Enable tmpfs
    tmpfs_size=10240,                       # tmpfs size in MB
    tmpfs_path='/tmp',                      # tmpfs mount path
    tmpfs_tempdir=True                      # Use tmpfs for METAFLOW_TEMPDIR
)

Security

@kubernetes(
    secrets=['api-keys', 'db-creds'],       # Kubernetes secrets (list)
    security_context={                       # Security context (dict)
        'run_as_user': 1000,
        'run_as_group': 1000,
        'run_as_non_root': True,
        'privileged': False
    }
)

Metadata

@kubernetes(
    labels={                                 # Pod labels (dict)
        'team': 'ml',
        'environment': 'production'
    },
    annotations={                            # Pod annotations (dict)
        'prometheus.io/scrape': 'true',
        'prometheus.io/port': '8080'
    }
)

Advanced Options

@kubernetes(
    port=29500,                             # Port for multi-node communication
    compute_pool='high-priority'            # Compute pool (maps to node selector)
)

Complete Configuration Example

Here’s a complete example showing environment variables and decorator usage:

Environment Setup

# config/kubernetes.env

# Cluster configuration
export METAFLOW_KUBERNETES_NAMESPACE=metaflow-prod
export METAFLOW_KUBERNETES_SERVICE_ACCOUNT=metaflow-runner

# Container defaults
export METAFLOW_KUBERNETES_CONTAINER_IMAGE=myregistry.io/metaflow:2.10.0
export METAFLOW_KUBERNETES_CONTAINER_REGISTRY=myregistry.io
export METAFLOW_KUBERNETES_IMAGE_PULL_POLICY=IfNotPresent
export METAFLOW_KUBERNETES_IMAGE_PULL_SECRETS='["regcred"]'

# Resource defaults
export METAFLOW_KUBERNETES_CPU=4
export METAFLOW_KUBERNETES_MEMORY=16384
export METAFLOW_KUBERNETES_DISK=20480

# Storage
export METAFLOW_DATASTORE_SYSROOT_S3=s3://metaflow-prod-bucket/metaflow

# Scheduling
export METAFLOW_KUBERNETES_NODE_SELECTOR="node.kubernetes.io/purpose=metaflow"
export METAFLOW_KUBERNETES_TOLERATIONS='[{"key":"metaflow","operator":"Equal","value":"true","effect":"NoSchedule"}]'

# Security
export METAFLOW_KUBERNETES_SECRETS=default-secrets

# Labels
export METAFLOW_KUBERNETES_LABELS="team=ml,cost-center=engineering"

Flow Implementation

from metaflow import FlowSpec, step, kubernetes, Parameter

class ProductionMLFlow(FlowSpec):
    
    model_version = Parameter('model_version', default='v1.0')
    
    @step
    def start(self):
        # Runs locally or on Argo
        print(f"Starting training for model {self.model_version}")
        self.next(self.prepare)
    
    @kubernetes(
        # Override defaults for data preparation
        cpu=2,
        memory=8192,
        labels={'step': 'data-prep'}
    )
    @step
    def prepare(self):
        # Light data preparation
        self.train_data = load_and_prepare_data()
        self.next(self.train)
    
    @kubernetes(
        # Heavy training workload
        cpu=16,
        memory=65536,
        gpu=4,
        gpu_vendor='nvidia',
        disk=102400,
        node_selector={
            'node.kubernetes.io/instance-type': 'p3.8xlarge',
            'topology.kubernetes.io/zone': 'us-east-1a'
        },
        persistent_volume_claims={
            'training-cache': '/mnt/cache'
        },
        shared_memory=16384,
        labels={'step': 'training', 'gpu': 'true'},
        annotations={'nvidia.com/gpu-memory': '32GB'}
    )
    @step
    def train(self):
        # Train model with GPUs
        self.model = train_model(
            self.train_data,
            version=self.model_version
        )
        self.next(self.evaluate)
    
    @kubernetes(
        cpu=4,
        memory=16384,
        labels={'step': 'evaluation'}
    )
    @step
    def evaluate(self):
        # Evaluate model
        self.metrics = evaluate_model(self.model)
        self.next(self.end)
    
    @step
    def end(self):
        print(f"Training complete. Accuracy: {self.metrics['accuracy']}")

if __name__ == '__main__':
    ProductionMLFlow()

Quality of Service (QoS) Classes

Kubernetes assigns QoS classes based on resource requests and limits:

Guaranteed QoS

@kubernetes(
    cpu=8,
    memory=32000,
    qos='Guaranteed'  # Sets requests = limits
)
@step
def guaranteed_step(self):
    # This pod gets highest priority and is never killed
    # unless it exceeds its limits
    pass
Characteristics:
  • CPU and memory limits are set equal to requests
  • Highest priority for scheduling
  • Last to be evicted under resource pressure
  • Best for critical, resource-intensive workloads

Burstable QoS (Default)

@kubernetes(
    cpu=4,
    memory=16000,
    qos='Burstable'  # Sets requests without limits (default)
)
@step
def burstable_step(self):
    # This pod can use more resources than requested
    # if available on the node
    pass
Characteristics:
  • CPU and memory requests set, but can burst above
  • Medium priority for scheduling
  • Can be throttled or evicted if resources are needed
  • Good for most workloads with variable resource usage

Multi-Node Configuration

For distributed workloads using @parallel:
from metaflow import kubernetes, parallel

@kubernetes(
    cpu=8,
    memory=32000,
    gpu=1,
    port=29500,  # Port for inter-node communication
    shared_memory=16384  # Required for distributed frameworks
)
@parallel(num_nodes=4)
@step
def distributed_step(self):
    from metaflow import current
    
    # Access multi-node information
    print(f"Node: {current.parallel.node_index}/{current.parallel.num_nodes}")
    print(f"Main IP: {current.parallel.main_ip}")
    
    # Set up distributed training
    setup_distributed(
        rank=current.parallel.node_index,
        world_size=current.parallel.num_nodes,
        master_addr=current.parallel.main_ip,
        master_port=29500
    )
    
    self.next(self.end)

Security Context Configuration

Configure container security context:
@kubernetes(
    security_context={
        'run_as_user': 1000,           # Run as non-root user
        'run_as_group': 1000,          # Primary group
        'run_as_non_root': True,       # Enforce non-root
        'privileged': False,           # No privileged mode
        'allow_privilege_escalation': False
    }
)
@step
def secure_step(self):
    # Runs with restricted permissions
    pass

Troubleshooting Configuration Issues

Symptoms: Pods stuck in ImagePullBackOff or ErrImagePullSolutions:
  1. Verify image name and registry:
    kubectl get events -n metaflow | grep -i pull
    
  2. Check image pull secrets:
    kubectl get secrets -n metaflow
    kubectl describe serviceaccount metaflow-sa -n metaflow
    
  3. Test image pull manually:
    kubectl run test --image=myregistry.io/metaflow:latest --dry-run=client -o yaml
    
Symptoms: Pods stuck in Pending with reason Quota ExceededSolutions:
  1. Check namespace quotas:
    kubectl describe quota -n metaflow
    
  2. View current resource usage:
    kubectl top pods -n metaflow
    kubectl top nodes
    
  3. Request quota increase or reduce resource requests
Symptoms: Pods stuck in Pending with reason UnschedulableSolutions:
  1. Check node selector and tolerations:
    kubectl describe pod <pod-name> -n metaflow
    
  2. Verify node availability:
    kubectl get nodes -l node.kubernetes.io/instance-type=m5.xlarge
    
  3. Check taints:
    kubectl describe nodes | grep -A5 Taints
    
Symptoms: Pods stuck in ContainerCreating with volume mount errorsSolutions:
  1. Verify PVC exists and is bound:
    kubectl get pvc -n metaflow
    kubectl describe pvc <pvc-name> -n metaflow
    
  2. Check storage class:
    kubectl get storageclass
    
  3. Ensure PVC access mode is compatible:
    kubectl get pvc <pvc-name> -n metaflow -o yaml | grep accessModes
    

Configuration Best Practices

Environment Separation

Use different namespaces and configurations for dev, staging, and production:
# dev.env
export METAFLOW_KUBERNETES_NAMESPACE=metaflow-dev

# prod.env
export METAFLOW_KUBERNETES_NAMESPACE=metaflow-prod

Resource Defaults

Set conservative defaults in environment variables and override in decorators as needed for specific steps.

Security First

Always configure security context, use secrets for sensitive data, and follow the principle of least privilege.

Labels and Annotations

Use consistent labeling for cost tracking, monitoring, and organization:
export METAFLOW_KUBERNETES_LABELS="team=ml,project=fraud-detection,environment=prod"

Next Steps

Kubernetes Overview

Learn about Kubernetes execution concepts

Argo Workflows

Deploy production workflows with Argo

Resource Management

Optimize resource allocation

Secrets Management

Secure sensitive data

Build docs developers (and LLMs) love