Skip to main content

Overview

This guide walks you through configuring Metaflow to work with AWS services, including AWS Step Functions for orchestration and AWS Batch for compute execution.
For a quick automated setup, use the CloudFormation templates provided by Outerbounds. This guide covers manual configuration for advanced users.

Prerequisites

  • AWS account with administrative access
  • AWS CLI installed and configured
  • Python 3.7+ with Metaflow installed
  • S3 bucket for data storage

Quick Setup

Using Metaflow Configure

Run the interactive configuration wizard:
metaflow configure aws
This will guide you through:
  1. AWS credentials setup
  2. S3 datastore configuration
  3. Metadata service connection (optional)
  4. AWS Batch configuration
  5. Step Functions settings

Environment Variables

Alternatively, configure via environment variables:
# S3 Datastore
export METAFLOW_DATASTORE_SYSROOT_S3=s3://my-metaflow-bucket/metaflow
export METAFLOW_DATATOOLS_S3ROOT=s3://my-metaflow-bucket/data

# AWS Batch
export METAFLOW_BATCH_JOB_QUEUE=my-job-queue
export METAFLOW_ECS_S3_ACCESS_IAM_ROLE=arn:aws:iam::123456789:role/MetaflowBatchRole

# Step Functions
export METAFLOW_SFN_IAM_ROLE=arn:aws:iam::123456789:role/MetaflowStepFunctionsRole
export METAFLOW_EVENTS_SFN_ACCESS_IAM_ROLE=arn:aws:iam::123456789:role/MetaflowEventsRole

AWS Resources Setup

1. S3 Bucket

Create an S3 bucket for Metaflow data:
aws s3 mb s3://my-metaflow-bucket --region us-east-1
Set bucket policy (optional, for encryption):
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "batch.amazonaws.com"
      },
      "Action": "s3:*",
      "Resource": [
        "arn:aws:s3:::my-metaflow-bucket",
        "arn:aws:s3:::my-metaflow-bucket/*"
      ]
    }
  ]
}

2. AWS Batch Setup

Compute Environment

Create a compute environment:
aws batch create-compute-environment \
  --compute-environment-name metaflow-compute-env \
  --type MANAGED \
  --state ENABLED \
  --compute-resources '{
    "type": "EC2",
    "minvCpus": 0,
    "maxvCpus": 256,
    "desiredvCpus": 0,
    "instanceTypes": ["optimal"],
    "subnets": ["subnet-12345"],
    "securityGroupIds": ["sg-12345"],
    "instanceRole": "arn:aws:iam::123456789:instance-profile/ecsInstanceRole"
  }' \
  --service-role arn:aws:iam::123456789:role/aws-batch-service-role
Use SPOT instances instead of EC2 for significant cost savings:
{
  "type": "SPOT",
  "bidPercentage": 100,
  ...
}

Job Queue

Create a job queue:
aws batch create-job-queue \
  --job-queue-name metaflow-job-queue \
  --state ENABLED \
  --priority 1 \
  --compute-environment-order order=1,computeEnvironment=metaflow-compute-env

3. DynamoDB Table (for Foreach)

If using foreach steps, create a DynamoDB table:
aws dynamodb create-table \
  --table-name metaflow-step-functions \
  --attribute-definitions \
    AttributeName=pathspec,AttributeType=S \
  --key-schema \
    AttributeName=pathspec,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST \
  --time-to-live-specification \
    Enabled=true,AttributeName=ttl
The TTL (time-to-live) attribute automatically cleans up old entries to reduce costs.

4. CloudWatch Log Group (Optional)

For Step Functions execution logging:
aws logs create-log-group \
  --log-group-name /aws/vendedlogs/states/metaflow

IAM Configuration

Batch Execution Role

Create an IAM role for AWS Batch containers:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "ecs-tasks.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
Attach policies:
aws iam attach-role-policy \
  --role-name MetaflowBatchRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess

aws iam attach-role-policy \
  --role-name MetaflowBatchRole \
  --policy-arn arn:aws:iam::aws:policy/CloudWatchLogsFullAccess
Custom policy for DynamoDB (if using foreach):
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "dynamodb:PutItem",
        "dynamodb:GetItem",
        "dynamodb:UpdateItem"
      ],
      "Resource": "arn:aws:dynamodb:*:*:table/metaflow-step-functions"
    }
  ]
}

Step Functions Execution Role

Create role for Step Functions:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "states.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
Attach policies:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "batch:SubmitJob",
        "batch:DescribeJobs",
        "batch:TerminateJob"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "events:PutTargets",
        "events:PutRule",
        "events:DescribeRule"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "dynamodb:GetItem"
      ],
      "Resource": "arn:aws:dynamodb:*:*:table/metaflow-step-functions"
    },
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogDelivery",
        "logs:GetLogDelivery",
        "logs:UpdateLogDelivery",
        "logs:DeleteLogDelivery",
        "logs:ListLogDeliveries",
        "logs:PutResourcePolicy",
        "logs:DescribeResourcePolicies",
        "logs:DescribeLogGroups"
      ],
      "Resource": "*"
    }
  ]
}

EventBridge Role (for Scheduled Workflows)

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "events.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
Policy:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "states:StartExecution",
      "Resource": "arn:aws:states:*:*:stateMachine:*"
    }
  ]
}

Configuration Reference

Required Variables

VariableDescriptionExample
METAFLOW_DATASTORE_SYSROOT_S3S3 path for datastores3://bucket/metaflow
METAFLOW_SFN_IAM_ROLEStep Functions IAM role ARNarn:aws:iam::123456789:role/SFNRole
METAFLOW_ECS_S3_ACCESS_IAM_ROLEBatch container IAM role ARNarn:aws:iam::123456789:role/BatchRole

Optional Variables

S3 Configuration

# Custom S3 endpoint (for S3-compatible storage)
export METAFLOW_S3_ENDPOINT_URL=https://s3.custom-endpoint.com

# Server-side encryption
export METAFLOW_S3_SERVER_SIDE_ENCRYPTION=AES256

# Data tools S3 location
export METAFLOW_DATATOOLS_S3ROOT=s3://bucket/data

# Card artifacts location
export METAFLOW_CARD_S3ROOT=s3://bucket/cards

AWS Batch Configuration

# Default job queue
export METAFLOW_BATCH_JOB_QUEUE=default-queue

# Default container image
export METAFLOW_BATCH_CONTAINER_IMAGE=my-image:latest

# Default container registry
export METAFLOW_BATCH_CONTAINER_REGISTRY=123456789.dkr.ecr.us-east-1.amazonaws.com

# Fargate execution role (for Fargate)
export METAFLOW_ECS_FARGATE_EXECUTION_ROLE=arn:aws:iam::123456789:role/FargateRole

# Enable/disable auto-tagging
export METAFLOW_BATCH_EMIT_TAGS=true

# Default AWS tags
export METAFLOW_BATCH_DEFAULT_TAGS='{"project": "ml", "team": "data"}'

Step Functions Configuration

# EventBridge IAM role (for schedules)
export METAFLOW_EVENTS_SFN_ACCESS_IAM_ROLE=arn:aws:iam::123456789:role/EventsRole

# DynamoDB table (for foreach)
export METAFLOW_SFN_DYNAMO_DB_TABLE=metaflow-step-functions

# CloudWatch log group ARN (for execution logs)
export METAFLOW_SFN_EXECUTION_LOG_GROUP_ARN=arn:aws:logs:us-east-1:123456789:log-group:/aws/vendedlogs/states/metaflow

# S3 path for distributed map outputs
export METAFLOW_SFN_S3_DISTRIBUTED_MAP_OUTPUT_PATH=s3://bucket/distributed-map

# State machine name prefix
export METAFLOW_SFN_STATE_MACHINE_PREFIX=prod

# Enable state machine compression
export METAFLOW_SFN_COMPRESS_STATE_MACHINE=true

Metadata Service

# Metadata service URL (for Metaflow service)
export METAFLOW_SERVICE_URL=https://metadata.example.com

# Internal service URL (for containers)
export METAFLOW_SERVICE_INTERNAL_URL=http://metadata.internal

# Service authentication headers
export METAFLOW_SERVICE_HEADERS='{"Authorization": "Bearer token"}'

# Default metadata backend
export METAFLOW_DEFAULT_METADATA=service

Secrets Management

# Default secrets backend
export METAFLOW_DEFAULT_SECRETS_BACKEND_TYPE=aws-secrets-manager

# Secrets Manager region
export METAFLOW_AWS_SECRETS_MANAGER_DEFAULT_REGION=us-east-1

Region Configuration

Metaflow uses your default AWS region from:
  1. AWS_DEFAULT_REGION environment variable
  2. AWS CLI configuration (~/.aws/config)
  3. EC2 instance metadata (when running on EC2)
Set explicitly:
export AWS_DEFAULT_REGION=us-west-2

VPC Configuration

For private VPC deployments:

Batch Compute Environment

Specify VPC settings when creating compute environment:
{
  "computeResources": {
    "subnets": [
      "subnet-12345",
      "subnet-67890"
    ],
    "securityGroupIds": [
      "sg-12345"
    ]
  }
}

VPC Endpoints

For fully private deployments, create VPC endpoints:
# S3 endpoint
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-12345 \
  --service-name com.amazonaws.us-east-1.s3 \
  --route-table-ids rtb-12345

# ECR endpoints
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-12345 \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.us-east-1.ecr.dkr

aws ec2 create-vpc-endpoint \
  --vpc-id vpc-12345 \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.us-east-1.ecr.api

Container Registry Setup

Amazon ECR

Create a repository:
aws ecr create-repository --repository-name metaflow/my-image
Authenticate Docker:
aws ecr get-login-password --region us-east-1 | \
  docker login --username AWS --password-stdin \
  123456789.dkr.ecr.us-east-1.amazonaws.com
Build and push:
docker build -t metaflow/my-image .
docker tag metaflow/my-image:latest \
  123456789.dkr.ecr.us-east-1.amazonaws.com/metaflow/my-image:latest
docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/metaflow/my-image:latest

Testing Configuration

Verify S3 Access

from metaflow import FlowSpec, step

class TestS3Flow(FlowSpec):
    @step
    def start(self):
        self.data = "test"
        self.next(self.end)
    
    @step
    def end(self):
        print(f"Data: {self.data}")

if __name__ == '__main__':
    TestS3Flow()
Run:
python test_flow.py run --datastore=s3

Verify Batch Access

from metaflow import FlowSpec, step, batch

class TestBatchFlow(FlowSpec):
    @batch(cpu=1, memory=2000)
    @step
    def start(self):
        print("Running on AWS Batch")
        self.next(self.end)
    
    @step
    def end(self):
        pass

if __name__ == '__main__':
    TestBatchFlow()
Run:
python test_batch_flow.py run --with batch

Verify Step Functions

python test_flow.py step-functions create --only-json
This validates configuration without deploying.

Configuration Files

Config File Location

Metaflow configuration is stored in:
  • ~/.metaflowconfig/config.json (user config)
  • Environment variables (take precedence)
Example config:
{
  "METAFLOW_DATASTORE_SYSROOT_S3": "s3://my-bucket/metaflow",
  "METAFLOW_BATCH_JOB_QUEUE": "default-queue",
  "METAFLOW_SFN_IAM_ROLE": "arn:aws:iam::123456789:role/SFNRole"
}

Environment-Specific Configs

Manage multiple environments:
# Production
export METAFLOW_PROFILE=production
source ~/.metaflow/production.env

# Staging  
export METAFLOW_PROFILE=staging
source ~/.metaflow/staging.env

Troubleshooting

Configuration Issues

Issue: “No IAM role found for AWS Step Functions” Solution: Set METAFLOW_SFN_IAM_ROLE:
export METAFLOW_SFN_IAM_ROLE=arn:aws:iam::123456789:role/StepFunctionsRole
Issue: “Unable to locate credentials” Solution: Configure AWS CLI:
aws configure
Issue: “Access Denied” errors Solution: Verify IAM role policies include necessary permissions for S3, Batch, Step Functions, and DynamoDB.

Network Issues

Issue: Container can’t reach metadata service Solution: Check VPC endpoints and security groups:
# Verify connectivity from container
ping metadata.service.internal
Issue: ECR pull failures Solution: Ensure:
  • VPC has ECR endpoints
  • Security groups allow HTTPS
  • IAM role has ECR permissions

Best Practices

Separate S3 buckets for different environments:
# Production
s3://prod-metaflow-bucket
# Staging
s3://staging-metaflow-bucket
# Development  
s3://dev-metaflow-bucket
Protect against accidental deletions:
aws s3api put-bucket-versioning \
  --bucket my-metaflow-bucket \
  --versioning-configuration Status=Enabled
Grant only necessary permissions to each role. Avoid wildcards in production.
Configure logging for troubleshooting:
export METAFLOW_SFN_EXECUTION_LOG_GROUP_ARN=arn:aws:logs:...
Use consistent tagging for cost allocation:
export METAFLOW_BATCH_DEFAULT_TAGS='{"Environment": "prod", "Team": "ml"}'

CloudFormation Templates

For automated setup, use CloudFormation templates:

Outerbounds CloudFormation

One-click AWS infrastructure setup with CloudFormation

Next Steps

AWS Batch

Start using AWS Batch for compute

Step Functions

Deploy workflows to Step Functions

Quickstart

Run your first Metaflow flow

Tutorial

Complete tutorial series

Build docs developers (and LLMs) love