Overview
Thedagster resume command re-executes a failed Dagster job while skipping steps that already completed successfully. It uses Metaflow’s --clone-run-id mechanism to reuse existing task outputs, making it efficient to recover from transient failures without re-running expensive computations.
Command Syntax
Options
Metaflow run ID of the failed run to resume (e.g.,
dagster-abc123). This is the run ID from the original failed execution.Path to the generated Dagster definitions file. Defaults to
<flowname>_dagster.py (lowercased).Dagster job name. Defaults to the flow name. Must match the job name used during the original run.
Tag for the new Metaflow run. Can be specified multiple times. These tags are added to the resumed run.
Inject a Metaflow step decorator at deploy time. Can be specified multiple times. These decorators apply to the resumed run.
Maximum wall-clock seconds for the entire resumed job run.
Metaflow namespace for the resumed run.
How It Works
When you resume a failed run:- Temporary Definitions File: A new Dagster definitions file is compiled with
ORIGIN_RUN_IDset to the failed run ID - Step Execution: When each step executes, it passes
--clone-run-idto the Metaflow CLI - Output Reuse: Metaflow checks if the task output already exists in the original run
- Conditional Execution: If the output exists and is valid, the step is skipped; otherwise, it re-executes
- New Run ID: The resumed run gets a new Metaflow run ID (derived from the new Dagster run UUID)
Examples
Basic Resume
Resume a failed run by its Metaflow run ID:Resume with Custom Definitions File
Specify a different definitions file:Resume with Additional Tags
Add tags to track the resumed run:Resume with Decorator Injection
Inject step decorators for the resumed run:Complete Example
Resume with all options:Output
On success, the command displays:Finding the Run ID
To find the Metaflow run ID of a failed run:Behavior with —clone-run-id
The--clone-run-id flag tells Metaflow to:
- Check if the task output exists in the original run
- If it exists and all inputs match, skip execution and reuse the output
- If it doesn’t exist or inputs changed, execute the step normally
- Completed steps: Skipped automatically, outputs reused from the original run
- Failed steps: Re-executed with the same parameters
- Downstream steps: Only execute if their inputs are available
Common Use Cases
Transient Failures
Retry after network or service interruptions:Debugging Failures
Resume with additional logging or debugging decorators:Increased Resources
Resume with more compute resources for steps that ran out of memory:Limitations
- The original run must exist in the Metaflow datastore
- Step outputs must be intact and accessible
- Cannot resume if the flow definition changed significantly (new steps, different DAG structure)
- Parameters from the original run are used; you cannot override them during resume
Next Steps
Create Definitions
Compile a flow to a Dagster definitions file
Trigger Runs
Launch a new Dagster job execution