This page covers SageMaker-specific details. For general AWS setup, see the AWS Integration page.
Installation
sagemaker>=2.237.3,<3.0.0- SageMaker SDKkubernetes- Kubernetes Python clientaws-profile-manager- AWS profile management
Components
SageMaker Orchestrator
Execute complete pipelines as SageMaker Pipelines
SageMaker Step Operator
Run individual steps as SageMaker Training/Processing jobs
SageMaker Orchestrator
Runs your complete pipeline as a SageMaker Pipeline with Processing or Training steps.Configuration
execution_role- IAM role ARN with SageMaker permissions
region- AWS region (default: from AWS config)bucket- S3 bucket for artifacts (default:sagemaker-{region}-{account-id})scheduler_role- IAM role ARN for scheduled pipelinesaws_access_key_id- AWS access keyaws_secret_access_key- AWS secret keyaws_profile- AWS profile nameaws_auth_role_arn- Intermediate role to assume
Step Settings
Customize individual steps withSagemakerOrchestratorSettings:
| Setting | Type | Default | Description |
|---|---|---|---|
instance_type | str | ml.m5.xlarge / ml.t3.medium | EC2 instance type |
volume_size_in_gb | int | 30 | EBS volume size |
max_runtime_in_seconds | int | 86400 | Max execution time |
execution_role | str | - | Override orchestrator role |
environment | dict | Environment variables | |
tags | dict | AWS tags for the job | |
synchronous | bool | True | Wait for completion |
use_training_step | bool | True | Use TrainingStep vs ProcessingStep |
keep_alive_period_in_seconds | int | 300 | Keep instance warm |
input_data_s3_mode | str | ”File” | Input data mode |
input_data_s3_uri | str/dict | None | S3 input data location |
output_data_s3_mode | str | ”EndOfJob” | Output data mode |
output_data_s3_uri | str/dict | None | S3 output location |
processor_args | dict | SageMaker Processor arguments | |
estimator_args | dict | SageMaker Estimator arguments |
ProcessingStep vs TrainingStep
ProcessingStep (default for processing):- For data transformation and preprocessing
- No distributed training support
- Lower cost for non-ML workloads
- Default instance:
ml.t3.medium
- Optimized for ML training
- Supports distributed training
- Managed spot training support
- Keep-alive for faster retries
- Default instance:
ml.m5.xlarge
Instance Types
Compute-Optimized:ml.c5.xlarge- 4 vCPU, 8 GB RAMml.c5.2xlarge- 8 vCPU, 16 GB RAMml.c5.4xlarge- 16 vCPU, 32 GB RAM
ml.r5.xlarge- 4 vCPU, 32 GB RAMml.r5.2xlarge- 8 vCPU, 64 GB RAMml.r5.4xlarge- 16 vCPU, 128 GB RAM
ml.p3.2xlarge- 8 vCPU, 61 GB RAM, 1x V100 (16GB)ml.p3.8xlarge- 32 vCPU, 244 GB RAM, 4x V100 (64GB)ml.p3.16xlarge- 64 vCPU, 488 GB RAM, 8x V100 (128GB)ml.g4dn.xlarge- 4 vCPU, 16 GB RAM, 1x T4 (16GB)ml.g5.xlarge- 4 vCPU, 16 GB RAM, 1x A10G (24GB)
Distributed Training
Multi-Instance Training:Data Modes
File Mode (default):- Downloads data before training
- Full dataset available locally
- Good for small to medium datasets
- Streams data during training
- Lower latency to start
- Good for large datasets
- Requires data pipeline support in code
SageMaker Step Operator
Runs individual steps as SageMaker jobs while orchestrating elsewhere.Configuration
Usage
IAM Permissions
Minimal IAM policy for the execution role:Complete Example
Best Practices
Use Appropriate Instance Types
Use Appropriate Instance Types
Match instance types to workload:
- Preprocessing:
ml.t3.mediumorml.m5.xlarge - Training (CPU):
ml.m5.2xlargeorml.c5.4xlarge - Training (GPU):
ml.p3.2xlargeorml.g4dn.xlarge - Large datasets:
ml.r5.*(memory-optimized)
Enable Keep-Alive for Rapid Iteration
Enable Keep-Alive for Rapid Iteration
During development, keep instances warm:
Use Spot Instances for Training
Use Spot Instances for Training
Save up to 90% on training costs:
Tag Resources for Cost Tracking
Tag Resources for Cost Tracking
Use tags to track costs by project/team:
Monitoring and Debugging
View Logs:- Go to SageMaker console
- Navigate to Pipelines > Pipeline executions
- Click on execution to see DAG and logs
- View CloudWatch logs for each step
Next Steps
AWS Integration
General AWS integration guide
Orchestrators
Learn about orchestration concepts
Remote Execution
Production deployment patterns
SageMaker Docs
Official SageMaker documentation
