Overview
This guide walks you through configuring Metaflow to work with AWS services, including AWS Step Functions for orchestration and AWS Batch for compute execution.
For a quick automated setup, use the CloudFormation templates provided by Outerbounds. This guide covers manual configuration for advanced users.
Prerequisites
AWS account with administrative access
AWS CLI installed and configured
Python 3.7+ with Metaflow installed
S3 bucket for data storage
Quick Setup
Run the interactive configuration wizard:
This will guide you through:
AWS credentials setup
S3 datastore configuration
Metadata service connection (optional)
AWS Batch configuration
Step Functions settings
Environment Variables
Alternatively, configure via environment variables:
# S3 Datastore
export METAFLOW_DATASTORE_SYSROOT_S3 = s3 :// my-metaflow-bucket / metaflow
export METAFLOW_DATATOOLS_S3ROOT = s3 :// my-metaflow-bucket / data
# AWS Batch
export METAFLOW_BATCH_JOB_QUEUE = my-job-queue
export METAFLOW_ECS_S3_ACCESS_IAM_ROLE = arn : aws : iam :: 123456789 : role / MetaflowBatchRole
# Step Functions
export METAFLOW_SFN_IAM_ROLE = arn : aws : iam :: 123456789 : role / MetaflowStepFunctionsRole
export METAFLOW_EVENTS_SFN_ACCESS_IAM_ROLE = arn : aws : iam :: 123456789 : role / MetaflowEventsRole
AWS Resources Setup
1. S3 Bucket
Create an S3 bucket for Metaflow data:
aws s3 mb s3://my-metaflow-bucket --region us-east-1
Set bucket policy (optional, for encryption):
{
"Version" : "2012-10-17" ,
"Statement" : [
{
"Effect" : "Allow" ,
"Principal" : {
"Service" : "batch.amazonaws.com"
},
"Action" : "s3:*" ,
"Resource" : [
"arn:aws:s3:::my-metaflow-bucket" ,
"arn:aws:s3:::my-metaflow-bucket/*"
]
}
]
}
2. AWS Batch Setup
Compute Environment
Create a compute environment:
aws batch create-compute-environment \
--compute-environment-name metaflow-compute-env \
--type MANAGED \
--state ENABLED \
--compute-resources '{
"type": "EC2",
"minvCpus": 0,
"maxvCpus": 256,
"desiredvCpus": 0,
"instanceTypes": ["optimal"],
"subnets": ["subnet-12345"],
"securityGroupIds": ["sg-12345"],
"instanceRole": "arn:aws:iam::123456789:instance-profile/ecsInstanceRole"
}' \
--service-role arn:aws:iam::123456789:role/aws-batch-service-role
Use SPOT instances instead of EC2 for significant cost savings: {
"type" : "SPOT" ,
"bidPercentage" : 100 ,
...
}
Job Queue
Create a job queue:
aws batch create-job-queue \
--job-queue-name metaflow-job-queue \
--state ENABLED \
--priority 1 \
--compute-environment-order order=1,computeEnvironment=metaflow-compute-env
3. DynamoDB Table (for Foreach)
If using foreach steps, create a DynamoDB table:
aws dynamodb create-table \
--table-name metaflow-step-functions \
--attribute-definitions \
AttributeName=pathspec,AttributeType=S \
--key-schema \
AttributeName=pathspec,KeyType=HASH \
--billing-mode PAY_PER_REQUEST \
--time-to-live-specification \
Enabled= true ,AttributeName=ttl
The TTL (time-to-live) attribute automatically cleans up old entries to reduce costs.
4. CloudWatch Log Group (Optional)
For Step Functions execution logging:
aws logs create-log-group \
--log-group-name /aws/vendedlogs/states/metaflow
IAM Configuration
Batch Execution Role
Create an IAM role for AWS Batch containers:
{
"Version" : "2012-10-17" ,
"Statement" : [
{
"Effect" : "Allow" ,
"Principal" : {
"Service" : "ecs-tasks.amazonaws.com"
},
"Action" : "sts:AssumeRole"
}
]
}
Attach policies:
aws iam attach-role-policy \
--role-name MetaflowBatchRole \
--policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
aws iam attach-role-policy \
--role-name MetaflowBatchRole \
--policy-arn arn:aws:iam::aws:policy/CloudWatchLogsFullAccess
Custom policy for DynamoDB (if using foreach):
{
"Version" : "2012-10-17" ,
"Statement" : [
{
"Effect" : "Allow" ,
"Action" : [
"dynamodb:PutItem" ,
"dynamodb:GetItem" ,
"dynamodb:UpdateItem"
],
"Resource" : "arn:aws:dynamodb:*:*:table/metaflow-step-functions"
}
]
}
Step Functions Execution Role
Create role for Step Functions:
{
"Version" : "2012-10-17" ,
"Statement" : [
{
"Effect" : "Allow" ,
"Principal" : {
"Service" : "states.amazonaws.com"
},
"Action" : "sts:AssumeRole"
}
]
}
Attach policies:
{
"Version" : "2012-10-17" ,
"Statement" : [
{
"Effect" : "Allow" ,
"Action" : [
"batch:SubmitJob" ,
"batch:DescribeJobs" ,
"batch:TerminateJob"
],
"Resource" : "*"
},
{
"Effect" : "Allow" ,
"Action" : [
"events:PutTargets" ,
"events:PutRule" ,
"events:DescribeRule"
],
"Resource" : "*"
},
{
"Effect" : "Allow" ,
"Action" : [
"dynamodb:GetItem"
],
"Resource" : "arn:aws:dynamodb:*:*:table/metaflow-step-functions"
},
{
"Effect" : "Allow" ,
"Action" : [
"logs:CreateLogDelivery" ,
"logs:GetLogDelivery" ,
"logs:UpdateLogDelivery" ,
"logs:DeleteLogDelivery" ,
"logs:ListLogDeliveries" ,
"logs:PutResourcePolicy" ,
"logs:DescribeResourcePolicies" ,
"logs:DescribeLogGroups"
],
"Resource" : "*"
}
]
}
EventBridge Role (for Scheduled Workflows)
{
"Version" : "2012-10-17" ,
"Statement" : [
{
"Effect" : "Allow" ,
"Principal" : {
"Service" : "events.amazonaws.com"
},
"Action" : "sts:AssumeRole"
}
]
}
Policy:
{
"Version" : "2012-10-17" ,
"Statement" : [
{
"Effect" : "Allow" ,
"Action" : "states:StartExecution" ,
"Resource" : "arn:aws:states:*:*:stateMachine:*"
}
]
}
Configuration Reference
Required Variables
Variable Description Example METAFLOW_DATASTORE_SYSROOT_S3S3 path for datastore s3://bucket/metaflowMETAFLOW_SFN_IAM_ROLEStep Functions IAM role ARN arn:aws:iam::123456789:role/SFNRoleMETAFLOW_ECS_S3_ACCESS_IAM_ROLEBatch container IAM role ARN arn:aws:iam::123456789:role/BatchRole
Optional Variables
S3 Configuration
# Custom S3 endpoint (for S3-compatible storage)
export METAFLOW_S3_ENDPOINT_URL = https :// s3 . custom-endpoint . com
# Server-side encryption
export METAFLOW_S3_SERVER_SIDE_ENCRYPTION = AES256
# Data tools S3 location
export METAFLOW_DATATOOLS_S3ROOT = s3 :// bucket / data
# Card artifacts location
export METAFLOW_CARD_S3ROOT = s3 :// bucket / cards
AWS Batch Configuration
# Default job queue
export METAFLOW_BATCH_JOB_QUEUE = default-queue
# Default container image
export METAFLOW_BATCH_CONTAINER_IMAGE = my-image : latest
# Default container registry
export METAFLOW_BATCH_CONTAINER_REGISTRY = 123456789 . dkr . ecr . us-east-1 . amazonaws . com
# Fargate execution role (for Fargate)
export METAFLOW_ECS_FARGATE_EXECUTION_ROLE = arn : aws : iam :: 123456789 : role / FargateRole
# Enable/disable auto-tagging
export METAFLOW_BATCH_EMIT_TAGS = true
# Default AWS tags
export METAFLOW_BATCH_DEFAULT_TAGS = '{"project": "ml", "team": "data"}'
Step Functions Configuration
# EventBridge IAM role (for schedules)
export METAFLOW_EVENTS_SFN_ACCESS_IAM_ROLE = arn : aws : iam :: 123456789 : role / EventsRole
# DynamoDB table (for foreach)
export METAFLOW_SFN_DYNAMO_DB_TABLE = metaflow-step-functions
# CloudWatch log group ARN (for execution logs)
export METAFLOW_SFN_EXECUTION_LOG_GROUP_ARN = arn : aws : logs : us-east-1 : 123456789 : log-group :/ aws / vendedlogs / states / metaflow
# S3 path for distributed map outputs
export METAFLOW_SFN_S3_DISTRIBUTED_MAP_OUTPUT_PATH = s3 :// bucket / distributed-map
# State machine name prefix
export METAFLOW_SFN_STATE_MACHINE_PREFIX = prod
# Enable state machine compression
export METAFLOW_SFN_COMPRESS_STATE_MACHINE = true
# Metadata service URL (for Metaflow service)
export METAFLOW_SERVICE_URL = https :// metadata . example . com
# Internal service URL (for containers)
export METAFLOW_SERVICE_INTERNAL_URL = http :// metadata . internal
# Service authentication headers
export METAFLOW_SERVICE_HEADERS = '{"Authorization": "Bearer token"}'
# Default metadata backend
export METAFLOW_DEFAULT_METADATA = service
Secrets Management
# Default secrets backend
export METAFLOW_DEFAULT_SECRETS_BACKEND_TYPE = aws-secrets-manager
# Secrets Manager region
export METAFLOW_AWS_SECRETS_MANAGER_DEFAULT_REGION = us-east-1
Region Configuration
Metaflow uses your default AWS region from:
AWS_DEFAULT_REGION environment variable
AWS CLI configuration (~/.aws/config)
EC2 instance metadata (when running on EC2)
Set explicitly:
export AWS_DEFAULT_REGION = us-west-2
VPC Configuration
For private VPC deployments:
Batch Compute Environment
Specify VPC settings when creating compute environment:
{
"computeResources" : {
"subnets" : [
"subnet-12345" ,
"subnet-67890"
],
"securityGroupIds" : [
"sg-12345"
]
}
}
VPC Endpoints
For fully private deployments, create VPC endpoints:
# S3 endpoint
aws ec2 create-vpc-endpoint \
--vpc-id vpc-12345 \
--service-name com.amazonaws.us-east-1.s3 \
--route-table-ids rtb-12345
# ECR endpoints
aws ec2 create-vpc-endpoint \
--vpc-id vpc-12345 \
--vpc-endpoint-type Interface \
--service-name com.amazonaws.us-east-1.ecr.dkr
aws ec2 create-vpc-endpoint \
--vpc-id vpc-12345 \
--vpc-endpoint-type Interface \
--service-name com.amazonaws.us-east-1.ecr.api
Container Registry Setup
Amazon ECR
Create a repository:
aws ecr create-repository --repository-name metaflow/my-image
Authenticate Docker:
aws ecr get-login-password --region us-east-1 | \
docker login --username AWS --password-stdin \
123456789.dkr.ecr.us-east-1.amazonaws.com
Build and push:
docker build -t metaflow/my-image .
docker tag metaflow/my-image:latest \
123456789.dkr.ecr.us-east-1.amazonaws.com/metaflow/my-image:latest
docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/metaflow/my-image:latest
Testing Configuration
Verify S3 Access
from metaflow import FlowSpec, step
class TestS3Flow ( FlowSpec ):
@step
def start ( self ):
self .data = "test"
self .next( self .end)
@step
def end ( self ):
print ( f "Data: { self .data } " )
if __name__ == '__main__' :
TestS3Flow()
Run:
python test_flow.py run --datastore=s3
Verify Batch Access
from metaflow import FlowSpec, step, batch
class TestBatchFlow ( FlowSpec ):
@batch ( cpu = 1 , memory = 2000 )
@step
def start ( self ):
print ( "Running on AWS Batch" )
self .next( self .end)
@step
def end ( self ):
pass
if __name__ == '__main__' :
TestBatchFlow()
Run:
python test_batch_flow.py run --with batch
Verify Step Functions
python test_flow.py step-functions create --only-json
This validates configuration without deploying.
Configuration Files
Config File Location
Metaflow configuration is stored in:
~/.metaflowconfig/config.json (user config)
Environment variables (take precedence)
Example config:
{
"METAFLOW_DATASTORE_SYSROOT_S3" : "s3://my-bucket/metaflow" ,
"METAFLOW_BATCH_JOB_QUEUE" : "default-queue" ,
"METAFLOW_SFN_IAM_ROLE" : "arn:aws:iam::123456789:role/SFNRole"
}
Environment-Specific Configs
Manage multiple environments:
# Production
export METAFLOW_PROFILE = production
source ~/.metaflow/production.env
# Staging
export METAFLOW_PROFILE = staging
source ~/.metaflow/staging.env
Troubleshooting
Configuration Issues
Issue : “No IAM role found for AWS Step Functions”
Solution : Set METAFLOW_SFN_IAM_ROLE:
export METAFLOW_SFN_IAM_ROLE = arn : aws : iam :: 123456789 : role / StepFunctionsRole
Issue : “Unable to locate credentials”
Solution : Configure AWS CLI:
Issue : “Access Denied” errors
Solution : Verify IAM role policies include necessary permissions for S3, Batch, Step Functions, and DynamoDB.
Network Issues
Issue : Container can’t reach metadata service
Solution : Check VPC endpoints and security groups:
# Verify connectivity from container
ping metadata.service.internal
Issue : ECR pull failures
Solution : Ensure:
VPC has ECR endpoints
Security groups allow HTTPS
IAM role has ECR permissions
Best Practices
Separate S3 buckets for different environments: # Production
s3://prod-metaflow-bucket
# Staging
s3://staging-metaflow-bucket
# Development
s3://dev-metaflow-bucket
Protect against accidental deletions: aws s3api put-bucket-versioning \
--bucket my-metaflow-bucket \
--versioning-configuration Status=Enabled
Use least privilege IAM roles
Grant only necessary permissions to each role. Avoid wildcards in production.
Enable CloudWatch logging
Configure logging for troubleshooting: export METAFLOW_SFN_EXECUTION_LOG_GROUP_ARN = arn : aws : logs :...
Use consistent tagging for cost allocation: export METAFLOW_BATCH_DEFAULT_TAGS = '{"Environment": "prod", "Team": "ml"}'
For automated setup, use CloudFormation templates:
Outerbounds CloudFormation One-click AWS infrastructure setup with CloudFormation
Next Steps
AWS Batch Start using AWS Batch for compute
Step Functions Deploy workflows to Step Functions
Quickstart Run your first Metaflow flow
Tutorial Complete tutorial series