Skip to main content
When a Dagster job run fails, metaflow-dagster can resume from the point of failure, skipping steps that already completed successfully. This saves time and compute resources by reusing existing outputs.

How it works

The dagster resume command:
  1. Identifies the failed Metaflow run by its run ID
  2. Compiles a new Dagster definitions file with ORIGIN_RUN_ID set
  3. Passes --clone-run-id to each step subprocess
  4. Metaflow reuses completed task outputs and re-executes only failed/pending steps
  5. Creates a new run ID for the resumed execution

Basic resume workflow

1

Identify the failed run

When a Dagster job fails, note the Metaflow run ID from the logs. It follows the pattern dagster-<hash>:
Run ID: dagster-abc123def456
2

Resume the run

Execute the resume command:
python my_flow.py dagster resume --run-id dagster-abc123def456
This creates a temporary definitions file, executes the resumed run, and cleans up automatically.
3

Monitor the resumed run

The command output shows:
Resuming MyFlow from run dagster-abc123def456 as job MyFlow...
Resumed Dagster job MyFlow (origin run: dagster-abc123def456, new run: dagster-xyz789ghi012).

Command syntax

Required flag

python my_flow.py dagster resume --run-id <dagster-run-id>
  • --run-id: Metaflow run ID of the failed run (e.g., dagster-abc123def456)

Optional flags

python my_flow.py dagster resume \
  --run-id dagster-abc123def456 \
  --definitions-file my_flow_resume.py \
  --job-name CustomJobName \
  --tag env:staging \
  --workflow-timeout 7200
FlagDescriptionDefault
--definitions-filePath to write resume definitions file<FlowName>_dagster.py
--job-nameDagster job nameFlow name (or project_FlowName)
--tagTag for the new Metaflow run (repeatable)None
--withInject step decorator at resume time (repeatable)None
--workflow-timeoutMaximum wall-clock seconds for the jobNone
--namespaceMetaflow namespace for the resumed runCurrent namespace

Saving the resume definitions file

By default, dagster resume creates a temporary file and deletes it after execution. To inspect or reuse the resume definitions file:
python my_flow.py dagster resume \
  --run-id dagster-abc123def456 \
  --definitions-file my_flow_resume.py
The generated file includes:
# Near the top of the file
ORIGIN_RUN_ID: Optional[str] = 'dagster-abc123def456'
This tells every step subprocess to use --clone-run-id when executing.

Understanding —clone-run-id

The --clone-run-id flag is Metaflow’s built-in mechanism for resuming runs:
  • When a step subprocess runs with --clone-run-id <origin-run-id>, Metaflow:
    • Checks if the task already completed successfully in the origin run
    • If yes: reuses the existing task outputs (no re-execution)
    • If no: runs the task normally
The resumed run shares the same pathspec structure as the origin run, so all artifact references remain valid.
Completed steps are never re-executed during resume. Only failed or pending steps run again.

Step-by-step example

Consider a flow with three steps: start, process, end.

Initial run fails

class DataPipeline(FlowSpec):
    @step
    def start(self):
        self.data = "initial data"
        self.next(self.process)
    
    @step
    def process(self):
        # This step fails
        raise RuntimeError("Processing failed!")
    
    @step
    def end(self):
        print("Done")
Run output:
Run ID: dagster-abc123
Step 'start': SUCCESS
Step 'process': FAILED
Step 'end': NOT RUN

Fix the code

Edit process to handle the error:
@step
def process(self):
    # Fixed implementation
    self.processed = self.data.upper()
    self.next(self.end)

Resume the run

python data_pipeline.py dagster resume --run-id dagster-abc123
Resume output:
Run ID: dagster-xyz789
Step 'start': SKIPPED (reusing from dagster-abc123)
Step 'process': RUNNING
Step 'process': SUCCESS
Step 'end': RUNNING
Step 'end': SUCCESS
The start step’s outputs from the original run (dagster-abc123) are reused automatically.

Resume with modified configuration

You can change tags, timeouts, and decorators during resume:
python my_flow.py dagster resume \
  --run-id dagster-abc123 \
  --tag retry:1 \
  --workflow-timeout 3600 \
  --with=sandbox
These modifications apply only to the resumed run, not the origin run.

Resume in production

For production deployments where runs are triggered via the Dagster UI or API:
1

Note the failed run ID

Find the Metaflow run ID in the failed job’s logs.
2

Generate resume definitions file

On the Dagster server or a machine with access to the flow file:
python my_flow.py dagster resume \
  --run-id <failed-run-id> \
  --definitions-file my_flow_resume.py
3

Deploy the resume file

Replace the active definitions file with the resume file:
cp my_flow_resume.py /path/to/dagster/workspace/
4

Trigger the resume job

In the Dagster UI:
  • Navigate to the job
  • Click Launch Run
Or via CLI:
dagster job execute -f my_flow_resume.py -j MyFlow
5

Restore the original definitions

After the resumed run completes, restore the original definitions file:
cp my_flow_dagster.py /path/to/dagster/workspace/
Do not delete or modify the origin run’s datastore files before resuming. The resume mechanism requires access to the original task outputs.

Limitations

Cannot resume from arbitrary steps

Resume always starts from the first failed step in topological order. You cannot manually select which steps to re-run.

Requires original datastore access

The resumed run must have read access to the origin run’s datastore. If using a remote datastore (S3, Azure, etc.), ensure credentials are still valid.

Parameter changes not supported

You cannot change flow parameter values during resume. The resumed run inherits parameters from the origin run’s _parameters task.

Checking run status

To verify which steps completed in the origin run:
from metaflow import Flow

run = Flow('MyFlow')['dagster-abc123']
for step in run:
    for task in step:
        print(f"{step.id}/{task.id}: {task.finished}")
Completed tasks show finished=True and have retrievable artifacts.

Next steps

Configuration

Learn how to configure metadata service, datastore, and runtime settings

Retries and Timeouts

Configure automatic retries and timeouts for individual steps