Resume Failed Runs

When a Dagster job run fails, metaflow-dagster can resume from the point of failure, skipping steps that already completed successfully. This saves time and compute resources by reusing existing outputs.

How it works

The dagster resume command:

Identifies the failed Metaflow run by its run ID
Compiles a new Dagster definitions file with ORIGIN_RUN_ID set
Passes --clone-run-id to each step subprocess
Metaflow reuses completed task outputs and re-executes only failed/pending steps
Creates a new run ID for the resumed execution

Basic resume workflow

Identify the failed run

When a Dagster job fails, note the Metaflow run ID from the logs. It follows the pattern dagster-<hash>:

Run ID: dagster-abc123def456

Resume the run

Execute the resume command:

python my_flow.py dagster resume --run-id dagster-abc123def456

This creates a temporary definitions file, executes the resumed run, and cleans up automatically.

Monitor the resumed run

The command output shows:

Resuming MyFlow from run dagster-abc123def456 as job MyFlow...
Resumed Dagster job MyFlow (origin run: dagster-abc123def456, new run: dagster-xyz789ghi012).

Command syntax

Required flag

python my_flow.py dagster resume --run-id <dagster-run-id>

--run-id: Metaflow run ID of the failed run (e.g., dagster-abc123def456)

Optional flags

python my_flow.py dagster resume \
  --run-id dagster-abc123def456 \
  --definitions-file my_flow_resume.py \
  --job-name CustomJobName \
  --tag env:staging \
  --workflow-timeout 7200

Flag	Description	Default
`--definitions-file`	Path to write resume definitions file	`<FlowName>_dagster.py`
`--job-name`	Dagster job name	Flow name (or project_FlowName)
`--tag`	Tag for the new Metaflow run (repeatable)	None
`--with`	Inject step decorator at resume time (repeatable)	None
`--workflow-timeout`	Maximum wall-clock seconds for the job	None
`--namespace`	Metaflow namespace for the resumed run	Current namespace

Saving the resume definitions file

By default, dagster resume creates a temporary file and deletes it after execution. To inspect or reuse the resume definitions file:

python my_flow.py dagster resume \
  --run-id dagster-abc123def456 \
  --definitions-file my_flow_resume.py

The generated file includes:

# Near the top of the file
ORIGIN_RUN_ID: Optional[str] = 'dagster-abc123def456'

This tells every step subprocess to use --clone-run-id when executing.

Understanding —clone-run-id

The --clone-run-id flag is Metaflow’s built-in mechanism for resuming runs:

When a step subprocess runs with --clone-run-id <origin-run-id>, Metaflow:
- Checks if the task already completed successfully in the origin run
- If yes: reuses the existing task outputs (no re-execution)
- If no: runs the task normally

The resumed run shares the same pathspec structure as the origin run, so all artifact references remain valid.

Completed steps are never re-executed during resume. Only failed or pending steps run again.

Step-by-step example

Consider a flow with three steps: start, process, end.

Initial run fails

class DataPipeline(FlowSpec):
    @step
    def start(self):
        self.data = "initial data"
        self.next(self.process)
    
    @step
    def process(self):
        # This step fails
        raise RuntimeError("Processing failed!")
    
    @step
    def end(self):
        print("Done")

Run output:

Run ID: dagster-abc123
Step 'start': SUCCESS
Step 'process': FAILED
Step 'end': NOT RUN

Fix the code

Edit process to handle the error:

@step
def process(self):
    # Fixed implementation
    self.processed = self.data.upper()
    self.next(self.end)

Resume the run

python data_pipeline.py dagster resume --run-id dagster-abc123

Resume output:

Run ID: dagster-xyz789
Step 'start': SKIPPED (reusing from dagster-abc123)
Step 'process': RUNNING
Step 'process': SUCCESS
Step 'end': RUNNING
Step 'end': SUCCESS

The start step’s outputs from the original run (dagster-abc123) are reused automatically.

Resume with modified configuration

You can change tags, timeouts, and decorators during resume:

python my_flow.py dagster resume \
  --run-id dagster-abc123 \
  --tag retry:1 \
  --workflow-timeout 3600 \
  --with=sandbox

These modifications apply only to the resumed run, not the origin run.

Resume in production

For production deployments where runs are triggered via the Dagster UI or API:

Note the failed run ID

Find the Metaflow run ID in the failed job’s logs.

Generate resume definitions file

On the Dagster server or a machine with access to the flow file:

python my_flow.py dagster resume \
  --run-id <failed-run-id> \
  --definitions-file my_flow_resume.py

Deploy the resume file

Replace the active definitions file with the resume file:

cp my_flow_resume.py /path/to/dagster/workspace/

Trigger the resume job

In the Dagster UI:

Navigate to the job
Click Launch Run

Or via CLI:

dagster job execute -f my_flow_resume.py -j MyFlow

Restore the original definitions

After the resumed run completes, restore the original definitions file:

cp my_flow_dagster.py /path/to/dagster/workspace/

Do not delete or modify the origin run’s datastore files before resuming. The resume mechanism requires access to the original task outputs.

Limitations

Cannot resume from arbitrary steps

Resume always starts from the first failed step in topological order. You cannot manually select which steps to re-run.

Requires original datastore access

The resumed run must have read access to the origin run’s datastore. If using a remote datastore (S3, Azure, etc.), ensure credentials are still valid.

Parameter changes not supported

You cannot change flow parameter values during resume. The resumed run inherits parameters from the origin run’s _parameters task.

Checking run status

To verify which steps completed in the origin run:

from metaflow import Flow

run = Flow('MyFlow')['dagster-abc123']
for step in run:
    for task in step:
        print(f"{step.id}/{task.id}: {task.finished}")

Completed tasks show finished=True and have retrievable artifacts.

Next steps

Configuration

Learn how to configure metadata service, datastore, and runtime settings

Retries and Timeouts

Configure automatic retries and timeouts for individual steps

Get Started

Core Concepts

Guides

Examples

How it works

Basic resume workflow

Command syntax

Required flag

Optional flags

Saving the resume definitions file

Understanding —clone-run-id

Step-by-step example

Initial run fails

Fix the code

Resume the run

Resume with modified configuration

Resume in production

Limitations

Cannot resume from arbitrary steps

Requires original datastore access

Parameter changes not supported

Checking run status

Next steps

Configuration

Retries and Timeouts

Get Started

Core Concepts

Guides

Examples

​How it works

​Basic resume workflow

​Command syntax

​Required flag

​Optional flags

​Saving the resume definitions file

​Understanding —clone-run-id

​Step-by-step example

​Initial run fails

​Fix the code

​Resume the run

​Resume with modified configuration

​Resume in production

​Limitations

​Cannot resume from arbitrary steps

​Requires original datastore access

​Parameter changes not supported

​Checking run status

​Next steps

Configuration

Retries and Timeouts

How it works

Basic resume workflow

Command syntax

Required flag

Optional flags

Saving the resume definitions file

Understanding —clone-run-id

Step-by-step example

Initial run fails

Fix the code

Resume the run

Resume with modified configuration

Resume in production

Limitations

Cannot resume from arbitrary steps

Requires original datastore access

Parameter changes not supported

Checking run status

Next steps