Overview
The@batch decorator executes a step on AWS Batch:
Setup
Set up AWS Batch infrastructure
You need:
- An AWS Batch Job Queue
- A Compute Environment
- An IAM role for job execution
- An S3 bucket for Metaflow data
Basic Usage
Simple Batch Step
Specify Resources
Use Custom Docker Image
Decorator Parameters
The@batch decorator accepts many parameters for fine-grained control:
Resource Allocation
Container Configuration
Advanced Options
Full Reference
See the source code for all parameters:Resource Management
Combining with @resources
Use@resources for portability:
GPU Support
Ensure your job queue is connected to a compute environment with GPU instances (p3, g4, etc.)
AWS Inferentia/Trainium
Multi-Node Execution
AWS Batch supports multi-node parallel jobs:Environment Configuration
Environment Variables
Set environment variables for AWS Batch steps:AWS_BATCH_JOB_ID: Job IDAWS_BATCH_JOB_ATTEMPT: Attempt numberAWS_BATCH_CE_NAME: Compute environment nameAWS_BATCH_JQ_NAME: Job queue name
Custom Docker Images
Build and use custom images:Conda Environments
Metaflow can build Conda environments automatically:Monitoring and Debugging
View Logs
Logs stream automatically:Check Job Status
AWS Console
Monitor jobs in the AWS Batch Console:- View job status and logs
- Check resource utilization
- Debug failed jobs
Error Handling
Automatic Retries
Timeout Protection
Spot Instance Handling
AWS Batch can use spot instances for cost savings. Metaflow automatically handles spot terminations:Cost Optimization
Use spot instances
Use spot instances
Configure your compute environment to use spot instances for up to 90% cost savings. Metaflow handles interruptions gracefully.
Right-size resources
Right-size resources
Monitor actual usage and adjust CPU/memory allocations. Over-provisioning wastes money.
Use efficient instance types
Use efficient instance types
Choose instance families based on workload:
c5: Compute-optimized (CPU-heavy)r5: Memory-optimized (large datasets)g4: GPU inferencep3/p4: GPU training
Enable auto-scaling
Enable auto-scaling
Configure your compute environment to scale to zero when idle. AWS Batch manages this automatically.
Best Practices
Use @resources for portability
Specify requirements with
@resources rather than @batch parameters to easily switch platformsKeep Docker images lean
Smaller images start faster and cost less to store. Only include necessary dependencies
Handle failures gracefully
Use
@retry and @catch decorators for robust production workflowsMonitor costs
Use AWS Cost Explorer to track Batch spending and optimize resource allocation
Troubleshooting
Common Issues
Job stuck in RUNNABLE state
Job stuck in RUNNABLE state
Cause: Compute environment can’t provision resourcesSolutions:
- Check compute environment status in AWS console
- Verify IAM roles have correct permissions
- Ensure requested instance types are available in your region
- Check service quotas (vCPU limits)
Job fails immediately
Job fails immediately
Cause: Container startup failureSolutions:
- Verify Docker image exists and is accessible
- Check IAM role has ECR pull permissions
- Review container logs in CloudWatch
- Test image locally:
docker run your-image
Out of memory errors
Out of memory errors
Cause: Insufficient memory allocationSolutions:
- Increase
memoryparameter - Process data in smaller chunks
- Use memory-efficient algorithms
- Consider using memory-optimized instances (r5)
Cannot access S3 data
Cannot access S3 data
Cause: Missing IAM permissionsSolutions:
- Verify IAM role has S3 read/write permissions
- Check bucket policy allows access
- Ensure
METAFLOW_DATASTORE_SYSROOT_S3is correct - Test with:
aws s3 ls s3://your-bucket/
Next Steps
Distributed Computing
Scale to multi-node distributed workloads
Resources Management
Master the @resources decorator
Kubernetes
Compare with Kubernetes execution
Remote Execution
Learn more about remote execution concepts
