Viewing Production Runs
List Runs via CLI
Each orchestrator provides commands to list runs: AWS Step Functions:Access Runs Programmatically
Query production runs from Python:Query Specific Runs
UI and Dashboards
Metaflow UI
If you have the Metaflow UI service deployed:- DAG visualization
- Step execution timeline
- Artifact inspection
- Log viewing
- Cards rendering
Orchestrator UIs
AWS Step Functions Console:- Navigate to AWS Step Functions in the AWS Console
- View state machine execution history
- Inspect input/output for each state
- View CloudWatch Logs (if enabled)
- Access the Argo UI at your configured endpoint
- View workflow templates and executions
- Real-time execution progress
- Pod logs and events
- Access Airflow webserver
- View DAG runs and task instances
- Task logs and XCom data
- Gantt charts and execution timeline
Logs and Debugging
Accessing Logs
Step Functions with CloudWatch: Enable execution history logging:/aws/vendedlogs/states/<name>.
Argo Workflows:
View logs directly from kubectl:
Programmatic Log Access
Alerting and Notifications
Argo Workflows Notifications
Configure notifications during deployment:Custom Alerting
Implement custom alerting in your flows:AWS CloudWatch Alarms
Set up CloudWatch alarms for Step Functions:Metrics and Observability
Built-in Metrics
Metaflow automatically tracks:- Execution duration for each step
- Resource usage (CPU, memory)
- Retry attempts
- Success/failure rates
Custom Metrics
Log custom metrics from your flows:Integration with Monitoring Tools
Datadog:Debugging Failed Runs
Resume Failed Runs
Metaflow allows resuming from failed steps:Inspect Failed Tasks
Debug Mode
Add debugging output to your flows:Performance Monitoring
Resource Usage Tracking
Cards for Monitoring
Use@card decorator for visual monitoring:
Best Practices
Tag your runs
Tag your runs
Always deploy with tags for easier filtering and analysis:
Set up alerting early
Set up alerting early
Configure failure notifications before deploying to production to catch issues quickly.
Log at appropriate levels
Log at appropriate levels
Use different logging levels (INFO, WARNING, ERROR) and only log what’s necessary to avoid log bloat.
Monitor resource usage
Monitor resource usage
Track memory and CPU usage to optimize resource allocation and costs.
Create dashboards
Create dashboards
Build dashboards showing key metrics like success rates, execution times, and throughput.
Retain execution history
Retain execution history
Keep execution history for debugging and compliance. Configure retention policies appropriately.
Troubleshooting Common Issues
High Failure Rate
- Check recent code changes
- Review error logs for patterns
- Verify upstream data quality
- Check resource limits
- Look for infrastructure issues
Slow Execution
- Profile resource usage
- Check for data volume increases
- Look for external service latency
- Review parallel execution settings
- Consider optimizing expensive steps
Missing Artifacts
- Verify datastore configuration
- Check storage permissions
- Look for cleanup policies
- Ensure artifacts are being saved
- Check for datastore connectivity issues
Next Steps
Debugging Flows
Learn debugging techniques
Cards
Create visual monitoring reports
Configuration
Configure monitoring integrations
