Step Operators
Step operators allow you to run individual pipeline steps on custom infrastructure. While an orchestrator defines how and where your entire pipeline runs, a step operator defines how and where a specific step runs.Overview
Step operators are useful when:- A specific step needs GPU resources (e.g., model training)
- A step requires more compute resources than others
- You want to run a step on different infrastructure (e.g., serverless)
- A step has special requirements (e.g., specific hardware, libraries)
How Step Operators Work
When a pipeline runs:- The orchestrator executes most steps normally
- When a step with a step operator is reached:
- The orchestrator hands off execution to the step operator
- The step operator runs the step on its configured infrastructure
- Results are returned to the orchestrator
- The orchestrator continues with the next step
Available Step Operators
Kubernetes Step Operator
Runs individual steps as Kubernetes jobs. Installation:- Kubernetes cluster access
- Container registry in your stack
- Configured kubectl context
- Running GPU-intensive training steps
- Steps requiring specific node pools
- Isolated execution environments
- Resource-intensive data processing
SageMaker Step Operator
Runs steps on AWS SageMaker Training Jobs. Installation:- AWS account with SageMaker access
- IAM role with SageMaker permissions
- Container registry (ECR)
- S3 artifact store
- Managed infrastructure
- Wide range of instance types
- Spot instance support
- Built-in monitoring
- Auto-scaling capabilities
- AWS-based ML infrastructure
- GPU/TPU training jobs
- Large-scale model training
- Cost optimization with spot instances
Vertex AI Step Operator
Runs steps on Google Cloud Vertex AI Custom Jobs. Installation:- GCP project with Vertex AI enabled
- Service account with Vertex AI permissions
- Container registry (GCR/Artifact Registry)
- GCS artifact store
- Managed ML infrastructure
- GPU and TPU support
- Custom machine types
- Pre-configured ML containers
- Integration with Vertex AI ecosystem
- GCP-based ML workflows
- GPU/TPU training
- Large-scale distributed training
- Vertex AI platform integration
Azure ML Step Operator
Runs steps on Azure Machine Learning Compute. Installation:- Azure subscription
- Azure ML workspace
- Compute cluster or compute instance
- Azure Container Registry
- Azure Blob Storage artifact store
- Managed compute resources
- Auto-scaling clusters
- GPU and CPU options
- Cost management
- Integration with Azure ML
- Azure-based infrastructure
- Enterprise Azure deployments
- GPU training on Azure
- Azure ML ecosystem integration
Modal Step Operator
Runs steps on Modal’s serverless infrastructure. Installation:- Serverless execution
- Pay-per-use pricing
- Fast cold starts
- GPU support
- Automatic scaling
- Serverless ML workflows
- Sporadic GPU needs
- Cost optimization
- Quick experimentation
Choosing a Step Operator
| Step Operator | Best For | Pricing Model | GPU Support |
|---|---|---|---|
| Kubernetes | Self-hosted, flexibility | Infrastructure cost | Yes (if cluster has GPUs) |
| SageMaker | AWS infrastructure | Per-second billing | Yes (wide selection) |
| Vertex AI | GCP infrastructure | Per-second billing | Yes (GPUs and TPUs) |
| Azure ML | Azure infrastructure | Per-minute billing | Yes (various SKUs) |
| Modal | Serverless, experimentation | Pay-per-use | Yes (on-demand) |
Resource Configuration
Specifying Resources
Cloud-Specific Resources
SageMaker:Mixed Infrastructure Pipelines
Combine different execution environments:Best Practices
Use Step Operators for Resource-Intensive Steps
Minimize Data Transfer
Configure Timeouts
Use Spot/Preemptible Instances
Monitoring Step Execution
Check Step Status
Cloud Console Monitoring
SageMaker:- AWS Console → SageMaker → Training jobs
- View logs in CloudWatch
- GCP Console → Vertex AI → Custom Jobs
- View logs in Cloud Logging
- Azure Portal → Machine Learning → Experiments
- View logs in workspace
Troubleshooting
Step Operator Not Found
Permission Errors
Resource Limits
Container Build Failures
Cost Optimization
Use Spot/Preemptible Instances
Save up to 90% on compute costs:Right-Size Resources
Set Timeouts
Next Steps
Stack Components Overview
Learn about other stack components
Advanced Pipelines
Build sophisticated ML pipelines
