Skip to main content

Overview

Metaflow provides several ways to manage Python dependencies:
  • Conda environments for reproducible package sets
  • PyPI packages for individual package installation
  • Docker images for complete environment control
  • Custom images for specialized requirements

Conda Environments

Using @conda

The @conda decorator creates isolated Conda environments for steps:
from metaflow import FlowSpec, step, conda

class CondaFlow(FlowSpec):
    
    @conda(libraries={'pandas': '2.0.0', 'scikit-learn': '1.3.0'})
    @step
    def start(self):
        import pandas as pd
        import sklearn
        
        print(f"pandas version: {pd.__version__}")
        print(f"sklearn version: {sklearn.__version__}")
        
        self.next(self.end)
    
    @step
    def end(self):
        pass

Conda Base Image

Specify a base Conda environment:
@conda(libraries={'tensorflow': '2.13.0'}, python='3.10')
@step
def train(self):
    import tensorflow as tf
    # Use TensorFlow

Conda Channels

@conda(
    libraries={'pytorch': '2.0.0'},
    channels=['pytorch', 'conda-forge']
)
@step
def pytorch_step(self):
    import torch

PyPI Packages

Using @pypi

Install packages from PyPI:
from metaflow import FlowSpec, step, pypi

class PyPIFlow(FlowSpec):
    
    @pypi(packages={'requests': '2.31.0', 'beautifulsoup4': '4.12.0'})
    @step
    def scrape(self):
        import requests
        from bs4 import BeautifulSoup
        
        response = requests.get('https://example.com')
        soup = BeautifulSoup(response.content, 'html.parser')
        
        self.next(self.end)
    
    @step
    def end(self):
        pass

PyPI with Index URLs

@pypi(
    packages={'my-private-package': '1.0.0'},
    index_url='https://pypi.mycompany.com/simple'
)
@step
def use_private_package(self):
    import my_private_package

Installing from Git

@pypi(packages={
    'mylib': 'git+https://github.com/user/[email protected]'
})
@step
def git_package(self):
    import mylib

Docker Images

Using @conda_base

Specify a custom Docker image with Conda:
from metaflow import FlowSpec, step, conda_base

class DockerFlow(FlowSpec):
    
    @conda_base(image='continuumio/miniconda3:latest')
    @step
    def start(self):
        # Runs in the specified Docker image
        self.next(self.end)
    
    @step
    def end(self):
        pass

Custom Docker Images

Build a custom Docker image:
# Dockerfile
FROM python:3.10

# Install system dependencies
RUN apt-get update && apt-get install -y \
    libpq-dev \
    build-essential

# Install Python packages
RUN pip install pandas==2.0.0 \
                numpy==1.24.0 \
                scikit-learn==1.3.0

# Install Metaflow
RUN pip install metaflow
Use the custom image:
@conda_base(image='myregistry/myimage:v1.0')
@batch(image='myregistry/myimage:v1.0')
@step
def custom_env(self):
    # Runs in custom image
    pass

UV Package Manager

Metaflow supports the modern UV package manager for faster dependency resolution:

Using @pypi with UV

from metaflow import FlowSpec, step, pypi

class UVFlow(FlowSpec):
    
    @pypi(
        packages={'numpy': '1.24.0', 'pandas': '2.0.0'},
        use_uv=True  # Use UV instead of pip
    )
    @step
    def fast_install(self):
        import numpy as np
        import pandas as pd
        self.next(self.end)
    
    @step
    def end(self):
        pass

Benefits of UV

  • 10-100x faster than pip for dependency resolution
  • Better conflict resolution
  • Reproducible installs with lock files
  • Compatible with pip workflows

Environment Configuration

requirements.txt

Use a requirements file:
@pypi(packages_from_file='requirements.txt')
@step
def from_requirements(self):
    # Installs all packages from requirements.txt
    pass
requirements.txt
# requirements.txt
pandas==2.0.0
numpy==1.24.0
scikit-learn==1.3.0
requests>=2.31.0

environment.yml

Use a Conda environment file:
@conda(environment='environment.yml')
@step
def from_conda_env(self):
    # Uses the Conda environment
    pass
environment.yml
name: myenv
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.10
  - pandas=2.0.0
  - numpy=1.24.0
  - pip:
    - requests==2.31.0

Multi-Step Dependencies

Different Dependencies per Step

class MultiEnvFlow(FlowSpec):
    
    @pypi(packages={'pandas': '2.0.0'})
    @step
    def data_processing(self):
        import pandas as pd
        self.df = pd.DataFrame({'a': [1, 2, 3]})
        self.next(self.ml_training)
    
    @conda(libraries={'tensorflow': '2.13.0', 'keras': '2.13.0'})
    @step
    def ml_training(self):
        import tensorflow as tf
        # Train model with TensorFlow
        self.next(self.end)
    
    @pypi(packages={'matplotlib': '3.7.0'})
    @step
    def end(self):
        import matplotlib.pyplot as plt
        # Create visualizations

Cloud Execution

AWS Batch with Dependencies

@conda(libraries={'torch': '2.0.0'})
@batch(cpu=8, memory=32000, gpu=1)
@step
def train_on_gpu(self):
    import torch
    
    # Check GPU availability
    assert torch.cuda.is_available()
    
    # Train model
    model = MyModel().cuda()

Kubernetes with Dependencies

@conda_base(image='pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime')
@kubernetes(cpu=16, memory=64000, gpu=4)
@step
def distributed_training(self):
    import torch
    import torch.distributed as dist
    
    # Multi-GPU training

Best Practices

Always specify exact versions for reproducibility:
# Good
@pypi(packages={'pandas': '2.0.0', 'numpy': '1.24.0'})

# Avoid - version may change
@pypi(packages={'pandas': 'latest'})
Conda handles complex dependencies better for scientific packages:
# Good - Conda handles BLAS/LAPACK dependencies
@conda(libraries={'numpy': '1.24.0', 'scipy': '1.11.0'})

# Avoid - pip may have binary compatibility issues
@pypi(packages={'numpy': '1.24.0', 'scipy': '1.11.0'})
Test dependency installation locally before cloud execution:
# Test locally
python flow.py run

# Then deploy to cloud
python flow.py run --with batch
Enable UV for faster package installation:
@pypi(packages={'pandas': '2.0.0'}, use_uv=True)
@step
def fast_step(self):
    pass
Use a container registry to cache images:
# Push to ECR
@conda_base(image='123456.dkr.ecr.us-east-1.amazonaws.com/myimage:v1')
@batch(cpu=4)
@step
def cached_env(self):
    # Image is cached in ECR
    pass

Common Patterns

ML Training Environment

@conda(
    libraries={
        'pytorch': '2.0.0',
        'transformers': '4.30.0',
        'datasets': '2.14.0',
        'wandb': '0.15.0'
    },
    channels=['pytorch', 'conda-forge']
)
@batch(cpu=16, memory=64000, gpu=4)
@step
def train_transformer(self):
    import torch
    from transformers import AutoModel, AutoTokenizer
    import wandb
    
    # Initialize W&B
    wandb.init(project='my-project')
    
    # Load model
    model = AutoModel.from_pretrained('bert-base-uncased')
    
    # Train...

Data Science Notebook

@pypi(
    packages={
        'jupyter': '1.0.0',
        'pandas': '2.0.0',
        'matplotlib': '3.7.0',
        'seaborn': '0.12.0',
        'scikit-learn': '1.3.0'
    }
)
@step
def analyze(self):
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    # Analysis code

Bioinformatics Pipeline

@conda(
    libraries={
        'biopython': '1.81',
        'pysam': '0.21.0',
        'pandas': '2.0.0'
    },
    channels=['bioconda', 'conda-forge']
)
@step
def process_sequences(self):
    from Bio import SeqIO
    import pysam
    
    # Process genomic data

Troubleshooting

Use Conda instead of pip for conflicting packages:
# If you get conflicts with pip
@conda(libraries={
    'numpy': '1.24.0',
    'scipy': '1.11.0',
    'scikit-learn': '1.3.0'
})
Enable UV or use pre-built Docker images:
# Use UV for faster installs
@pypi(packages={...}, use_uv=True)

# Or use pre-built image
@conda_base(image='myregistry/myimage:latest')
Use Conda for packages with complex binary dependencies:
# Good for CUDA, MKL, etc.
@conda(libraries={
    'pytorch': '2.0.0',
    'cudatoolkit': '11.7'
}, channels=['pytorch'])

Environment Decorator

Setting environment variables

AWS Batch

Running with custom images on Batch

Kubernetes

Using custom images on Kubernetes

Docker Images

Building and using Docker images

Build docs developers (and LLMs) love