Scaling & Compute Overview

Metaflow makes it easy to scale your data science workflows from local development to production-grade compute infrastructure. You can develop and test locally, then seamlessly execute the same code on powerful cloud resources.

Why Scale?

Data science workflows often need more resources than a laptop can provide:

Large datasets that don’t fit in local memory
Compute-intensive operations like training ML models
Parallel processing across multiple machines
GPU acceleration for deep learning
Long-running jobs that need dedicated infrastructure

Scaling Approaches

Metaflow provides multiple ways to scale your workflows:

Remote Execution

Run individual steps on cloud compute while keeping control local

AWS Batch

Execute steps on AWS Batch for scalable, managed compute

Kubernetes

Run steps on Kubernetes clusters for container-based orchestration

Distributed Computing

Coordinate multi-node jobs for parallel and distributed workloads

Key Concepts

Decorators for Compute

Metaflow uses Python decorators to specify compute requirements:

from metaflow import FlowSpec, step, batch, resources

class MyFlow(FlowSpec):
    @batch
    @resources(cpu=4, memory=16000)
    @step
    def train(self):
        # This step runs on AWS Batch with 4 CPUs and 16GB RAM
        pass

Portable Resource Specs

The @resources decorator lets you specify requirements independently of the compute platform:

@resources(cpu=2, memory=8000, gpu=1)
@step
def process(self):
    pass

Then choose the platform at runtime:

# Run on AWS Batch
python myflow.py run --with batch

# Run on Kubernetes
python myflow.py run --with kubernetes

Hybrid Execution

You can mix local and remote execution in the same flow:

class HybridFlow(FlowSpec):
    @step
    def start(self):
        # Runs locally
        self.data = load_data()
        self.next(self.process)
    
    @batch
    @resources(cpu=16, memory=64000)
    @step
    def process(self):
        # Runs on AWS Batch
        self.results = expensive_computation(self.data)
        self.next(self.end)
    
    @step
    def end(self):
        # Runs locally
        save_results(self.results)

Platform Support

Feature	AWS Batch	Kubernetes	Local
CPU control	✓	✓	✗
Memory control	✓	✓	✗
GPU support	✓	✓	✗
Disk size	Limited	✓	✗
Multi-node	✓	✓	✓
Auto-scaling	✓	✓	✗

Getting Started

Define resource requirements

Add @resources decorators to steps that need more compute:

@resources(cpu=4, memory=16000)
@step
def heavy_step(self):
    pass

Choose a compute platform

Add @batch or @kubernetes to execute on cloud infrastructure:

@batch
@resources(cpu=4, memory=16000)
@step
def heavy_step(self):
    pass

Configure your environment

Set up AWS credentials or Kubernetes access. See platform-specific guides:

Run your flow

Execute your flow - decorated steps will run on the specified platform:

python myflow.py run

Best Practices

Start small and scale up

Develop and test locally first. Add compute decorators only to steps that need them. This keeps development fast and costs low.

Use @resources for portability

Specify requirements with @resources instead of platform-specific parameters. This makes it easy to switch between AWS Batch and Kubernetes.

Right-size your resources

Monitor actual usage and adjust CPU, memory, and GPU allocations. Over-provisioning wastes money; under-provisioning causes failures.

Leverage data locality

Keep data close to compute. Use S3 with AWS Batch, appropriate storage with Kubernetes. Metaflow handles data movement automatically.

Next Steps

Remote Execution

Learn about running steps remotely

Resources Decorator

Deep dive into @resources options

AWS Batch

Set up AWS Batch integration

Kubernetes

Configure Kubernetes execution

Getting Started

Core Concepts

Building Flows

Scaling & Compute

Production Deployment

Multi-Cloud Support

Advanced Features

Guides

Scaling & Compute Overview

Why Scale?

Scaling Approaches

Remote Execution

AWS Batch

Kubernetes

Distributed Computing

Key Concepts

Decorators for Compute

Portable Resource Specs

Hybrid Execution

Platform Support

Getting Started

Best Practices

Next Steps

Remote Execution

Resources Decorator

AWS Batch

Kubernetes

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Building Flows

Scaling & Compute

Production Deployment

Multi-Cloud Support

Advanced Features

Guides

​Why Scale?

​Scaling Approaches

Remote Execution

AWS Batch

Kubernetes

Distributed Computing

​Key Concepts

​Decorators for Compute

​Portable Resource Specs

​Hybrid Execution

​Platform Support

​Getting Started

​Best Practices

​Next Steps

Remote Execution

Resources Decorator

AWS Batch

Kubernetes

Build docs developers (and LLMs) love

Why Scale?

Scaling Approaches

Key Concepts

Decorators for Compute

Portable Resource Specs

Hybrid Execution

Platform Support

Getting Started

Best Practices

Next Steps