Overview
AWS Step Functions is a serverless orchestrator that allows you to deploy Metaflow workflows to production. Once deployed, your workflows run automatically on a schedule or can be triggered on-demand, with each step executing on AWS Batch compute resources.AWS Step Functions requires AWS Batch to execute tasks. Make sure you have AWS Batch configured before deploying workflows.
Key Features
Serverless Orchestration
No infrastructure to manage - AWS Step Functions handles workflow coordination automatically
Visual Monitoring
Track workflow execution through the AWS Console with detailed state machine visualizations
Production Tokens
Secure deployment model using production tokens for authorization and namespace isolation
Event-Driven
Schedule workflows with cron expressions or trigger them programmatically
Deploying to Step Functions
Basic Deployment
Deploy your flow to AWS Step Functions using thestep-functions create command:
- Compiles your flow into an AWS Step Functions state machine
- Uploads your code package to S3
- Creates AWS Batch job definitions for each step
- Deploys the state machine to your AWS account
- Generates a production token for authorization
Production Tokens
The first time you deploy a flow, Metaflow generates a production token:- Creates a unique namespace for your production flow
- Authorizes future deployments and modifications
- Allows team members to collaborate on the same deployment
Deployment Options
Scheduling Workflows
Schedule your workflow using the@schedule decorator:
Workflow Timeout
Set a maximum execution time for your workflow:Maximum Concurrency
Limit parallel execution for foreach steps:Execution History Logging
Enable CloudWatch logging for detailed execution history:This requires the
METAFLOW_SFN_EXECUTION_LOG_GROUP_ARN environment variable to be set.Distributed Map
For large-scale foreach operations, use distributed map:Triggering Executions
Manual Trigger
Trigger a deployed workflow manually:Trigger with Parameters
Pass parameters to your flow execution:Managing Deployments
List Executions
View all executions of your deployed workflow:Terminate Execution
Stop a running execution:Delete Deployment
Remove a workflow deployment from Step Functions:Advanced Features
State Machine Compression
For flows with long command strings, compress the state machine definition:Custom State Machine Name
Use a custom name for your state machine:Projects and Branches
For projects, use branches instead of custom names:Viewing State Machine JSON
Inspect the generated state machine definition:Configuration
Step Functions requires these environment variables:| Variable | Description | Required |
|---|---|---|
METAFLOW_SFN_IAM_ROLE | IAM role ARN for Step Functions | Yes |
METAFLOW_EVENTS_SFN_ACCESS_IAM_ROLE | IAM role ARN for EventBridge | For schedules |
METAFLOW_SFN_DYNAMO_DB_TABLE | DynamoDB table for foreach coordination | For foreach |
METAFLOW_SFN_EXECUTION_LOG_GROUP_ARN | CloudWatch log group ARN | For logging |
METAFLOW_SFN_S3_DISTRIBUTED_MAP_OUTPUT_PATH | S3 path for distributed map outputs | For distributed map |
METAFLOW_SFN_STATE_MACHINE_PREFIX | Prefix for state machine names | Optional |
Limitations
Monitoring and Debugging
AWS Console
Monitor your workflows in the AWS Step Functions console:- Navigate to AWS Step Functions in your AWS Console
- Find your state machine (named after your flow)
- View execution history and state transitions
- Inspect input/output for each step
Metaflow Client API
Access execution data programmatically:CloudWatch Logs
If execution history logging is enabled, view detailed logs in CloudWatch:- Navigate to CloudWatch Logs in AWS Console
- Find your log group (specified in
METAFLOW_SFN_EXECUTION_LOG_GROUP_ARN) - Filter by execution ARN or state machine name
Best Practices
Use S3 datastore
Use S3 datastore
Always deploy with
--datastore=s3 (default). Step Functions requires S3 for data persistence.Set appropriate timeouts
Set appropriate timeouts
Configure timeouts at both the workflow level (
--workflow-timeout) and step level (@timeout decorator) to prevent runaway executions.Monitor execution costs
Monitor execution costs
AWS Step Functions charges per state transition. Optimize your flow structure to minimize unnecessary states.
Test locally first
Test locally first
Always test your flow locally with
python myflow.py run before deploying to Step Functions.Use tags for organization
Use tags for organization
Troubleshooting
State Machine Creation Fails
Error: “No IAM role found for AWS Step Functions” Solution: Set theMETAFLOW_SFN_IAM_ROLE environment variable. See configuration docs.
Foreach Steps Fail
Error: “An AWS DynamoDB table is needed to support foreach” Solution: Create a DynamoDB table and setMETAFLOW_SFN_DYNAMO_DB_TABLE.
State Machine Too Large
Error: State machine definition exceeds size limit Solution: Use--compress-state-machine to offload commands to S3.
Parameter Size Limit
Error: “Length of parameter names and values shouldn’t exceed 20480” Solution: Pass large data through the datastore instead of parameters:Next Steps
AWS Batch
Configure AWS Batch for task execution
AWS Configuration
Complete AWS setup and IAM configuration
Scheduling
Learn more about scheduling workflows
Monitoring
Monitor production workflows
