Remote Execution - Metaflow

Remote execution allows you to run individual steps of your workflow on cloud infrastructure while keeping development and control on your local machine. This is one of Metaflow’s core features that enables seamless scaling.

How It Works

When you add a compute decorator like @batch or @kubernetes to a step, Metaflow:

Packages your code and dependencies
Uploads to cloud storage (S3, Azure Blob, GCS)
Launches a remote container with your specified resources
Executes the step on remote compute
Stores results back to cloud storage
Continues with the next step (local or remote)

from metaflow import FlowSpec, step, batch

class RemoteFlow(FlowSpec):
    @step
    def start(self):
        # Runs locally on your laptop
        print("Starting locally")
        self.data = [1, 2, 3, 4, 5]
        self.next(self.process)
    
    @batch
    @step
    def process(self):
        # Runs remotely on AWS Batch
        print("Processing on AWS Batch")
        self.results = [x * 2 for x in self.data]
        self.next(self.end)
    
    @step
    def end(self):
        # Runs locally again
        print(f"Results: {self.results}")

Benefits

Seamless Development

Develop and test locally, deploy remotely without code changes

Resource Flexibility

Use powerful machines only when needed, keeping costs down

Automatic Data Flow

Data automatically moves between local and remote steps

Hybrid Workflows

Mix local and remote execution in the same workflow

Choosing Between Local and Remote

Decide where each step should run based on its requirements:

Run Locally When:

Fast iteration is needed during development
Small data that fits in memory
Light computation that completes quickly
Interactive work requiring immediate feedback
Cost sensitivity - avoid cloud charges for simple operations

Run Remotely When:

Large memory requirements (>16GB)
Many CPUs needed (>4 cores)
GPU acceleration required
Long-running jobs (>30 minutes)
Large datasets that don’t fit locally
Production deployment for reliability

Data Transfer

Metaflow automatically handles data transfer between steps:

class DataTransferFlow(FlowSpec):
    @step
    def start(self):
        # Local step creates data
        self.large_array = np.random.rand(1000000)
        self.next(self.process)
    
    @batch
    @step
    def process(self):
        # Remote step receives data automatically
        print(f"Array size: {len(self.large_array)}")
        self.result = self.large_array.sum()
        self.next(self.end)
    
    @step
    def end(self):
        # Local step gets results back
        print(f"Sum: {self.result}")

Data Storage

Data flows through cloud storage:

AWS Batch: Uses S3 (requires --datastore=s3)
Kubernetes: Uses S3, Azure Blob, or GCS
Local: Uses local filesystem

Always use --datastore=s3 (or azure/gs) when running remote steps. Local datastore doesn’t work with remote execution.

Code Packaging

Metaflow automatically packages your code for remote execution:

# Your local code is packaged automatically
python myflow.py run --with batch

What Gets Packaged:

Your flow script
Any local Python modules imported by your flow
Files in the same directory (by default)

What Doesn’t Get Packaged:

Large data files (use cloud storage instead)
Virtual environment (recreated remotely)
System libraries (use Docker images)

Custom Packaging

Include specific files or directories:

from metaflow import FlowSpec, step, batch

class CustomPackageFlow(FlowSpec):
    """
    Include: mydata/*.csv, config/
    """
    
    @batch
    @step
    def start(self):
        # Can access mydata/*.csv and config/ files
        pass

Environment Management

Conda Environments

Metaflow can recreate your Conda environment remotely:

from metaflow import FlowSpec, step, batch, conda

class CondaFlow(FlowSpec):
    @conda(libraries={'pandas': '2.0.0', 'scikit-learn': '1.3.0'})
    @batch
    @step
    def train(self):
        import pandas as pd
        import sklearn
        # Use packages on remote instance
        pass

Docker Images

Specify custom Docker images with all dependencies:

from metaflow import FlowSpec, step, batch

class DockerFlow(FlowSpec):
    @batch(image='myregistry/ml-image:v1.0')
    @step
    def train(self):
        # Runs in custom Docker container
        pass

Monitoring Remote Execution

Real-time Logs

Logs stream to your terminal automatically:

python myflow.py run
# [remote-step/123] Starting remote execution
# [remote-step/123] Processing data...
# [remote-step/123] Complete!

Check Status

Monitor running tasks:

# List all runs
python myflow.py list runs

# Show specific run
python myflow.py show 123

# Check logs later
python myflow.py logs 123/remote-step/456

Error Handling

Remote execution includes automatic retry and error handling:

from metaflow import FlowSpec, step, batch, retry

class RobustFlow(FlowSpec):
    @retry(times=3)
    @batch
    @step
    def flaky_step(self):
        # Automatically retried up to 3 times on failure
        risky_operation()
        self.next(self.end)

Fallback to Local

Use @catch for graceful degradation:

from metaflow import FlowSpec, step, batch, catch

class FallbackFlow(FlowSpec):
    @catch(var='compute_failed')
    @batch
    @step
    def remote_step(self):
        # Try remote execution
        self.result = expensive_computation()
        self.next(self.end)
    
    @step
    def end(self):
        if self.compute_failed:
            print("Remote execution failed, using cached results")

Configuration

AWS Batch Setup

Configure AWS Batch access:

# Set AWS credentials
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret

# Configure Metaflow
export METAFLOW_BATCH_JOB_QUEUE=your-queue
export METAFLOW_DATASTORE_SYSROOT_S3=s3://your-bucket/metaflow

Kubernetes Setup

Configure Kubernetes access:

# Ensure kubectl is configured
kubectl cluster-info

# Set Metaflow config
export METAFLOW_KUBERNETES_NAMESPACE=metaflow
export METAFLOW_DATASTORE_SYSROOT_S3=s3://your-bucket/metaflow

See platform-specific pages for detailed setup:

Performance Tips

Minimize data transfer

Process data where it lives. Load large datasets from S3 within remote steps rather than passing through local steps.

@batch
@step
def process(self):
    # Good: Load data directly on remote
    df = pd.read_csv('s3://bucket/large.csv')
    
    # Bad: Don't pass large data from local
    # self.df = pd.read_csv('local_large.csv')

Reuse environments

Docker images and Conda environments are cached. Reusing the same specifications across runs speeds up startup time.

Batch similar operations

Group operations that need similar resources into the same step rather than creating many small remote steps.

Use foreach for parallelism

Process independent items in parallel across multiple remote workers:

@step
def start(self):
    self.items = range(100)
    self.next(self.process, foreach='items')

@batch
@step
def process(self):
    # Each item runs on separate remote instance
    self.result = expensive_op(self.input)
    self.next(self.join)

Common Patterns

ETL Pipeline

class ETLFlow(FlowSpec):
    @step
    def start(self):
        # Local: Identify data to process
        self.files = list_s3_files('s3://bucket/raw/')
        self.next(self.extract, foreach='files')
    
    @batch
    @resources(cpu=4, memory=16000)
    @step
    def extract(self):
        # Remote: Heavy extraction
        self.data = extract_and_transform(self.input)
        self.next(self.join)
    
    @step
    def join(self, inputs):
        # Local: Merge results
        self.merged = merge_results([i.data for i in inputs])
        self.next(self.end)

ML Training

class TrainFlow(FlowSpec):
    @step
    def start(self):
        # Local: Prepare experiment parameters
        self.configs = generate_configs()
        self.next(self.train)
    
    @batch(gpu=1, memory=32000)
    @step
    def train(self):
        # Remote GPU: Train model
        self.model = train_model(self.configs)
        self.metrics = evaluate(self.model)
        self.next(self.end)
    
    @step
    def end(self):
        # Local: Log results
        print(f"Accuracy: {self.metrics['accuracy']}")

Next Steps

AWS Batch

Configure AWS Batch for remote execution

Kubernetes

Set up Kubernetes clusters

Resources

Manage CPU, memory, and GPU requirements

Distributed

Scale to multi-node workloads

Getting Started

Core Concepts

Building Flows

Scaling & Compute

Production Deployment

Multi-Cloud Support

Advanced Features

Guides

​How It Works

​Benefits

Seamless Development

Resource Flexibility

Automatic Data Flow

Hybrid Workflows

​Choosing Between Local and Remote

​Run Locally When:

​Run Remotely When:

​Data Transfer

​Data Storage

​Code Packaging

​What Gets Packaged:

​What Doesn’t Get Packaged:

​Custom Packaging

​Environment Management

​Conda Environments

​Docker Images

​Monitoring Remote Execution

​Real-time Logs

​Check Status

​Error Handling

​Fallback to Local

​Configuration

​AWS Batch Setup

​Kubernetes Setup

​Performance Tips

​Common Patterns

​ETL Pipeline

​ML Training

​Next Steps

AWS Batch

Kubernetes

Resources

Distributed

Build docs developers (and LLMs) love

How It Works

Benefits

Choosing Between Local and Remote

Run Locally When:

Run Remotely When:

Data Transfer

Data Storage

Code Packaging

What Gets Packaged:

What Doesn’t Get Packaged:

Custom Packaging

Environment Management

Conda Environments

Docker Images

Monitoring Remote Execution

Real-time Logs

Check Status

Error Handling

Fallback to Local

Configuration

AWS Batch Setup

Kubernetes Setup

Performance Tips

Common Patterns

ETL Pipeline

ML Training

Next Steps