How it works
Thedagster resume command:
- Identifies the failed Metaflow run by its run ID
- Compiles a new Dagster definitions file with
ORIGIN_RUN_IDset - Passes
--clone-run-idto each step subprocess - Metaflow reuses completed task outputs and re-executes only failed/pending steps
- Creates a new run ID for the resumed execution
Basic resume workflow
Identify the failed run
When a Dagster job fails, note the Metaflow run ID from the logs. It follows the pattern
dagster-<hash>:Resume the run
Execute the resume command:This creates a temporary definitions file, executes the resumed run, and cleans up automatically.
Command syntax
Required flag
--run-id: Metaflow run ID of the failed run (e.g.,dagster-abc123def456)
Optional flags
| Flag | Description | Default |
|---|---|---|
--definitions-file | Path to write resume definitions file | <FlowName>_dagster.py |
--job-name | Dagster job name | Flow name (or project_FlowName) |
--tag | Tag for the new Metaflow run (repeatable) | None |
--with | Inject step decorator at resume time (repeatable) | None |
--workflow-timeout | Maximum wall-clock seconds for the job | None |
--namespace | Metaflow namespace for the resumed run | Current namespace |
Saving the resume definitions file
By default,dagster resume creates a temporary file and deletes it after execution. To inspect or reuse the resume definitions file:
--clone-run-id when executing.
Understanding —clone-run-id
The--clone-run-id flag is Metaflow’s built-in mechanism for resuming runs:
- When a step subprocess runs with
--clone-run-id <origin-run-id>, Metaflow:- Checks if the task already completed successfully in the origin run
- If yes: reuses the existing task outputs (no re-execution)
- If no: runs the task normally
Completed steps are never re-executed during resume. Only failed or pending steps run again.
Step-by-step example
Consider a flow with three steps:start, process, end.
Initial run fails
Fix the code
Editprocess to handle the error:
Resume the run
start step’s outputs from the original run (dagster-abc123) are reused automatically.
Resume with modified configuration
You can change tags, timeouts, and decorators during resume:Resume in production
For production deployments where runs are triggered via the Dagster UI or API:Limitations
Cannot resume from arbitrary steps
Resume always starts from the first failed step in topological order. You cannot manually select which steps to re-run.Requires original datastore access
The resumed run must have read access to the origin run’s datastore. If using a remote datastore (S3, Azure, etc.), ensure credentials are still valid.Parameter changes not supported
You cannot change flow parameter values during resume. The resumed run inherits parameters from the origin run’s_parameters task.
Checking run status
To verify which steps completed in the origin run:finished=True and have retrievable artifacts.
Next steps
Configuration
Learn how to configure metadata service, datastore, and runtime settings
Retries and Timeouts
Configure automatic retries and timeouts for individual steps