Overview
metaflow-dagster compiles your Metaflow flow’s DAG into a self-contained Dagster definitions file. The compilation process transforms each Metaflow step into a Dagster @op, preserving the execution semantics while enabling Dagster’s orchestration features.
Compilation Process
When you runpython my_flow.py dagster create dagster_defs.py, the compiler:
- Analyzes the flow graph — Reads the Metaflow FlowSpec and traverses the DAG structure
- Generates Dagster code — Creates a complete Python file with ops, jobs, and helpers
- Embeds metadata — Bakes in datastore settings, environment configuration, and decorators
- Creates definitions — Produces schedules and sensors if decorators like
@scheduleor@triggerare present
The generated file is self-contained — it includes all the plumbing needed to run your flow through Dagster without additional configuration.
Step to Op Translation
Each Metaflow step becomes a Dagster@op that:
Runs as a subprocess
Every op executes the step via the standard Metaflow CLI:_run_step() helper function in the generated file (see metaflow_extensions/dagster/plugins/dagster/dagster_compiler.py:356) that:
- Constructs the CLI command with all necessary flags
- Forwards retry counts, tags, and environment variables
- Handles conda environments by swapping the Python interpreter
- Streams logs through Dagster’s log panel
Passes artifacts via --input-paths
Metaflow’s artifact passing mechanism remains intact. The generated op code:
- Receives the parent task’s pathspec (e.g.,
run_id/start/1) - Passes it to the
metaflow stepcommand via--input-paths - For join steps, compresses multiple paths using
compress_list()
Artifacts are stored in the Metaflow datastore (local filesystem or S3/Azure). Dagster never loads them — it only passes pathspecs between ops.
Preserves decorators
Metaflow decorators like@retry, @timeout, @resources, and @environment are translated to Dagster equivalents:
@retry → RetryPolicy
From dagster_compiler.py:866:
@timeout → op tags
From dagster_compiler.py:882:
@environment → subprocess env vars
From dagster_compiler.py:894:
metaflow step subprocess.
Execution Flow
Here’s what happens when you run a compiled flow:1. Start op initializes the run
Theop_start function:
dagster_compiler.py:235, _run_init() creates the _parameters task:
METAFLOW_INIT_<NAME> environment variables.
2. Downstream ops execute sequentially or in parallel
Each op:- Receives the upstream task pathspec
- Runs
metaflow step <name> --input-paths <upstream_pathspec> - Emits the new task pathspec for the next op
3. Artifact metadata is attached to Dagster outputs
After each step completes, the op calls_add_step_metadata() (from dagster_compiler.py:518), which:
- Reads
0.data.jsonfrom the local datastore - Extracts artifact keys (excluding internal ones like
_foreach_num_splits) - Emits a retrieval snippet to the Dagster UI:
Generated File Structure
From the compiler header comments (dagster_compiler.py:10-19), the generated file contains:
Why Subprocess Execution?
Running each step as a subprocess viametaflow step ensures:
- Full compatibility with Metaflow’s runtime hooks (decorators like
@conda,@sandbox,@batch) - Artifact isolation — each step writes to its own task directory in the datastore
- Retry semantics — Metaflow’s retry counter is passed correctly via
--retry-count - Environment consistency — the same Python interpreter and packages are used as in native Metaflow runs
The generated file never imports your flow class. It only invokes the CLI, so changes to your flow code (like adding prints or assertions) are picked up immediately without regenerating the Dagster definitions.
Next Steps
Graph Shapes
Learn how different Metaflow patterns translate to Dagster ops
Compilation Details
Explore what happens during
dagster create