This guide will help you create your first Metaflow flow using the enhanced Conda decorator from the Netflix Extensions.
Your First Flow with Conda
Create a simple flow
Create a file called hello_conda.py with a basic flow that uses different pandas versions: from metaflow import FlowSpec, step, conda
class HelloCondaFlow ( FlowSpec ):
@conda ( libraries = { "pandas" : "1.4.0" }, python = ">=3.8,<3.9" )
@step
def start ( self ):
import pandas as pd
assert pd. __version__ == "1.4.0"
print ( "Step 'start': Pandas version is %s " % pd. __version__ )
self .next( self .end)
@conda ( libraries = { "pandas" : "1.5.0" }, python = ">=3.8,<3.9" )
@step
def end ( self ):
import pandas as pd
assert pd. __version__ == "1.5.0"
print ( "Step 'end': Pandas version is %s " % pd. __version__ )
if __name__ == "__main__" :
HelloCondaFlow()
Each step can have its own isolated environment with different package versions!
Run the flow
Execute the flow with the --environment=conda flag: python hello_conda.py --environment=conda run
Expected output: Metaflow executing HelloCondaFlow
Resolving 2 environments ... done in 27 seconds.
Workflow starting (run-id 1)
Using existing Conda environment (42a4ed94b63f)
[1/start/12345 (pid 1234)] Task is starting.
[1/start/12345 (pid 1234)] Step 'start': Pandas version is 1.4.0
[1/start/12345 (pid 1234)] Task finished successfully.
Using existing Conda environment (3e07a415e776)
[1/end/12346 (pid 1235)] Task is starting.
[1/end/12346 (pid 1235)] Step 'end': Pandas version is 1.5.0
[1/end/12346 (pid 1235)] Task finished successfully.
Done!
The first run will resolve and cache environments. Subsequent runs will reuse cached environments and be much faster!
Using PyPI Packages
The Netflix Extensions provide a dedicated @pypi decorator for pure Python package environments:
from metaflow import FlowSpec, step, pypi
class PyPiFlow ( FlowSpec ):
@pypi ( packages = { "pandas" : "1.4.0" }, python = ">=3.8,<3.9" )
@step
def start ( self ):
import pandas as pd
print ( f "Using pandas { pd. __version__ } from PyPI" )
self .next( self .end)
@pypi ( packages = { "pandas" : "1.5.0" }, python = ">=3.8,<3.9" )
@step
def end ( self ):
import pandas as pd
print ( f "Using pandas { pd. __version__ } from PyPI" )
if __name__ == "__main__" :
PyPiFlow()
python pypi_example.py --environment=conda run
Mixing Conda and PyPI Packages
You can combine Conda packages (for system libraries) with PyPI packages:
from metaflow import FlowSpec, step, conda, pypi
class MixedPackagesFlow ( FlowSpec ):
@conda ( libraries = { "ffmpeg" : "" })
@pypi ( packages = { "ffmpeg-python" : "0.2.0" })
@step
def start ( self ):
import ffmpeg
import subprocess
# Use ffmpeg executable from Conda
result = subprocess.run([ "ffmpeg" , "-version" ], capture_output = True )
print ( "FFmpeg is available!" )
# Use ffmpeg-python library from PyPI
print ( f "ffmpeg-python version: { ffmpeg. __version__ } " )
self .next( self .end)
@step
def end ( self ):
print ( "Video processing complete!" )
if __name__ == "__main__" :
MixedPackagesFlow()
Mixing Conda and PyPI packages uses conda-lock for resolution, which may be slower but provides maximum flexibility.
Flow-Level Decorators
Use @conda_base or @pypi_base to set default dependencies for all steps:
from metaflow import FlowSpec, step, conda_base, conda
@conda_base ( libraries = { "numpy" : "1.21.5" }, python = ">=3.8,<3.9" )
class BaseDecoratorFlow ( FlowSpec ):
@step
def start ( self ):
import numpy as np
# This step inherits numpy 1.21.5 from @conda_base
print ( f "NumPy version: { np. __version__ } " )
assert np. __version__ == "1.21.5"
self .next( self .process)
@conda ( libraries = { "numpy" : "1.21.6" })
@step
def process ( self ):
import numpy as np
# This step overrides with numpy 1.21.6
print ( f "NumPy version: { np. __version__ } " )
assert np. __version__ == "1.21.6"
self .next( self .end)
@conda ( disabled = True )
@step
def end ( self ):
# This step runs in your local environment (no Conda)
print ( "Running in local environment" )
if __name__ == "__main__" :
BaseDecoratorFlow()
@conda_base applies default packages to all steps
Step-level decorators override flow-level settings
Use disabled=True to opt specific steps out of Conda
Using Requirements Files
You can also define environments using traditional requirements.txt or environment.yml files:
requirements.txt
environment.yml
numpy==1.21.5
pandas>=1.4.0,<2.0.0
scikit-learn==1.0.2
Resolve and Cache Environments
# Using requirements.txt
metaflow environment resolve --python ">=3.8,<3.9" -r requirements.txt
# Using environment.yml
metaflow environment resolve --python ">=3.8,<3.9" -f environment.yml
Once resolved, these environments are cached and can be reused across flows using named environments.
Named Environments
Create reusable environments with aliases:
Resolve and name an environment
metaflow environment resolve \
--python ">=3.8,<3.9" \
--alias my-org/my-team/ml-env:v1 \
-f environment.yml
Use the named environment in your flow
from metaflow import FlowSpec, step, named_env
class NamedEnvFlow ( FlowSpec ):
@named_env ( name = "my-org/my-team/ml-env:v1" )
@step
def start ( self ):
import numpy as np
print ( f "Using pre-resolved environment with NumPy { np. __version__ } " )
self .next( self .end)
@step
def end ( self ):
print ( "Done!" )
if __name__ == "__main__" :
NamedEnvFlow()
Named environments are perfect for:
Sharing environments across teams
Ensuring consistent environments across flows
Quick environment reuse without re-resolution
Running on Remote Compute
The extension works seamlessly with Metaflow’s remote execution:
# Run on AWS Batch
python hello_conda.py --environment=conda run --with batch
# Run on Kubernetes
python hello_conda.py --environment=conda run --with kubernetes
Environments are resolved locally and automatically hydrated on remote nodes. Packages are downloaded from your configured cloud storage (S3/Azure/GCS).
Inspecting Environments
View detailed information about resolved environments:
# Show environment for a specific step
metaflow environment show --pathspec HelloCondaFlow/1/start
# Show all environments in a flow
python hello_conda.py --environment=conda environment resolve
Example output:
### Environment for step start ###
Environment full hash: 42a4ed94b63f12e1:a3b104c4ce221535
Arch: linux-64
Resolved on: 2024-03-09 10:30:15
Resolved by: alice
User-requested packages:
conda::pandas==1.4.0
conda::boto3>=1.14.0
conda::python>=3.8,<3.9
Conda packages installed:
pandas==1.4.0
numpy==1.22.3
python==3.8.17
...
Creating Development Environments
Create local Conda environments for debugging:
metaflow environment create \
--name my-debug-env \
--install-notebook \
--pathspec HelloCondaFlow/1/start
This creates:
A local Conda environment named my-debug-env
A Jupyter kernel with the same name
Access to all step artifacts
Use this for debugging failed runs or exploring artifacts in the exact environment they were created in!
Advanced: Pure PyPI with Conda System Packages
Install system tools via Conda while keeping Python packages pure PyPI:
--conda-pkg ffmpeg
--conda-pkg git-lfs
ffmpeg-python==0.2.0
transnetv2 @ git+https://github.com/soCzech/TransNetV2.git#main
Git repositories and local packages only work when resolving for the same architecture you’re running on (no cross-platform resolution).
Next Steps
Full Documentation Explore advanced features and detailed documentation
Debug Extension Learn about the Jupyter debugging integration
Configuration Fine-tune performance and behavior
Join Slack Get help from the Metaflow community
Common Patterns
Data Science Stack
from metaflow import FlowSpec, step, conda_base
@conda_base (
libraries = {
"numpy" : "1.21.5" ,
"pandas" : "1.4.0" ,
"scikit-learn" : "1.0.2" ,
"matplotlib" : "3.5.0"
},
python = ">=3.8,<3.9"
)
class DataScienceFlow ( FlowSpec ):
# All steps inherit the data science stack
pass
Machine Learning with GPU
@pypi ( packages = { "torch" : "1.12.0" })
@step
def train ( self ):
import torch
print ( f "CUDA available: { torch.cuda.is_available() } " )
# Training code here
Different Requirements per Step
class MultiStepFlow ( FlowSpec ):
@conda ( libraries = { "pandas" : "1.4.0" })
@step
def load_data ( self ):
# Light environment for data loading
pass
@conda ( libraries = {
"pandas" : "1.4.0" ,
"scikit-learn" : "1.0.2" ,
"xgboost" : "1.6.0"
})
@step
def train_model ( self ):
# Heavy environment for training
pass
Each step gets exactly what it needs - no more, no less!