Introduction
The Kubernetes integration allows Metaflow to execute individual steps of your flows as Kubernetes Jobs on any Kubernetes cluster. This provides a flexible, cloud-agnostic way to scale your workflows using containerized workloads.Key Features
Container-Native
Run your workflows in Docker containers with full control over the runtime environment
Resource Management
Request specific CPU, memory, GPU, and disk resources for each step
Cloud Agnostic
Deploy on any Kubernetes cluster: AWS EKS, Azure AKS, GCP GKE, or on-premises
Multi-Node Support
Execute distributed workloads with gang-scheduled multi-node jobs using JobSets
How It Works
When you decorate a step with@kubernetes, Metaflow:
- Packages your code and uploads it to your configured datastore (S3, Azure Blob, or GCS)
- Creates a Kubernetes Job specification with your resource requirements
- Submits the job to your Kubernetes cluster
- Monitors the job execution and streams logs back to you
- Retrieves results from the datastore once the job completes
Architecture

Components
Metaflow Client: Submits jobs and monitors execution from your local machine or CI/CD system. Kubernetes Control Plane: Schedules and manages pod lifecycle based on job specifications. Worker Pods: Execute your Metaflow tasks in isolated containers with specified resources. Datastore: Central storage (S3/Azure/GCS) for code packages, artifacts, and metadata.Execution Modes
Single-Node Execution
The standard execution mode where each step runs in a single Kubernetes pod:Multi-Node Execution with @parallel
For distributed workloads, combine@kubernetes with @parallel to create gang-scheduled multi-node jobs using Kubernetes JobSets:
Supported Kubernetes Distributions
Metaflow works with any standard Kubernetes cluster:- AWS: Amazon Elastic Kubernetes Service (EKS)
- Azure: Azure Kubernetes Service (AKS)
- Google Cloud: Google Kubernetes Engine (GKE)
- On-Premises: Self-managed Kubernetes clusters
- Other: DigitalOcean Kubernetes, Linode Kubernetes Engine, etc.
Prerequisites
Configuration
Configure Metaflow to use your Kubernetes cluster by setting environment variables:Resource Specification
Specify compute resources for each step:Advanced Features
Custom Node Selection
Target specific nodes using node selectors:Tolerations
Schedule on tainted nodes:Persistent Volumes
Mount persistent volumes for shared storage:Secrets Management
Access Kubernetes secrets:Monitoring and Debugging
View Running Jobs
Access Logs
Metaflow automatically streams logs during execution. You can also access them directly:Debug Failed Jobs
Best Practices
Use Appropriate Resource Limits
Use Appropriate Resource Limits
Set realistic resource requests to avoid over-provisioning:
- Start with conservative estimates
- Monitor actual usage with
kubectl top pods - Adjust based on observed requirements
- Use QoS classes effectively (Guaranteed vs Burstable)
Container Image Management
Container Image Management
Optimize your Docker images:
- Use multi-stage builds to reduce image size
- Cache dependencies in image layers
- Pin specific versions for reproducibility
- Use image pull secrets for private registries
Namespace Organization
Namespace Organization
Use Kubernetes namespaces effectively:
- Separate production and development workloads
- Apply resource quotas per namespace
- Use NetworkPolicies for isolation
- Implement RBAC for access control
Cost Optimization
Cost Optimization
Reduce cloud costs:
- Use spot/preemptible instances for fault-tolerant workloads
- Set appropriate timeouts with
@timeoutdecorator - Clean up completed jobs regularly
- Use cluster autoscaling
Comparison with AWS Batch
| Feature | Kubernetes | AWS Batch |
|---|---|---|
| Cloud Agnostic | ✅ Yes | ❌ AWS Only |
| Multi-Cloud | ✅ Supported | ❌ No |
| Setup Complexity | Medium | Low |
| Container Management | Full Control | Managed |
| Cost Control | Direct | Through AWS |
| GPU Support | ✅ Yes | ✅ Yes |
| Multi-Node Jobs | ✅ JobSets | ✅ Array Jobs |
| Spot Instances | ✅ Yes | ✅ Yes |
Next Steps
Argo Workflows
Deploy production workflows with Argo Workflows orchestrator
Configuration
Explore all configuration options and environment variables
Best Practices
Learn best practices for production deployments
Debugging
Debug and troubleshoot Kubernetes execution issues
