Skip to main content
Metaflow makes it easy to scale your data science workflows from local development to production-grade compute infrastructure. You can develop and test locally, then seamlessly execute the same code on powerful cloud resources.

Why Scale?

Data science workflows often need more resources than a laptop can provide:
  • Large datasets that don’t fit in local memory
  • Compute-intensive operations like training ML models
  • Parallel processing across multiple machines
  • GPU acceleration for deep learning
  • Long-running jobs that need dedicated infrastructure

Scaling Approaches

Metaflow provides multiple ways to scale your workflows:

Remote Execution

Run individual steps on cloud compute while keeping control local

AWS Batch

Execute steps on AWS Batch for scalable, managed compute

Kubernetes

Run steps on Kubernetes clusters for container-based orchestration

Distributed Computing

Coordinate multi-node jobs for parallel and distributed workloads

Key Concepts

Decorators for Compute

Metaflow uses Python decorators to specify compute requirements:
from metaflow import FlowSpec, step, batch, resources

class MyFlow(FlowSpec):
    @batch
    @resources(cpu=4, memory=16000)
    @step
    def train(self):
        # This step runs on AWS Batch with 4 CPUs and 16GB RAM
        pass

Portable Resource Specs

The @resources decorator lets you specify requirements independently of the compute platform:
@resources(cpu=2, memory=8000, gpu=1)
@step
def process(self):
    pass
Then choose the platform at runtime:
# Run on AWS Batch
python myflow.py run --with batch

# Run on Kubernetes
python myflow.py run --with kubernetes

Hybrid Execution

You can mix local and remote execution in the same flow:
class HybridFlow(FlowSpec):
    @step
    def start(self):
        # Runs locally
        self.data = load_data()
        self.next(self.process)
    
    @batch
    @resources(cpu=16, memory=64000)
    @step
    def process(self):
        # Runs on AWS Batch
        self.results = expensive_computation(self.data)
        self.next(self.end)
    
    @step
    def end(self):
        # Runs locally
        save_results(self.results)

Platform Support

FeatureAWS BatchKubernetesLocal
CPU control
Memory control
GPU support
Disk sizeLimited
Multi-node
Auto-scaling

Getting Started

1

Define resource requirements

Add @resources decorators to steps that need more compute:
@resources(cpu=4, memory=16000)
@step
def heavy_step(self):
    pass
2

Choose a compute platform

Add @batch or @kubernetes to execute on cloud infrastructure:
@batch
@resources(cpu=4, memory=16000)
@step
def heavy_step(self):
    pass
3

Configure your environment

Set up AWS credentials or Kubernetes access. See platform-specific guides:
4

Run your flow

Execute your flow - decorated steps will run on the specified platform:
python myflow.py run

Best Practices

Develop and test locally first. Add compute decorators only to steps that need them. This keeps development fast and costs low.
Specify requirements with @resources instead of platform-specific parameters. This makes it easy to switch between AWS Batch and Kubernetes.
Monitor actual usage and adjust CPU, memory, and GPU allocations. Over-provisioning wastes money; under-provisioning causes failures.
Keep data close to compute. Use S3 with AWS Batch, appropriate storage with Kubernetes. Metaflow handles data movement automatically.

Next Steps

Remote Execution

Learn about running steps remotely

Resources Decorator

Deep dive into @resources options

AWS Batch

Set up AWS Batch integration

Kubernetes

Configure Kubernetes execution

Build docs developers (and LLMs) love