Skip to main content
Conda v2 is a completely refactored environment management system for Metaflow that provides robust, reproducible dependency management for data science workflows.

The Problem

When running Metaflow code locally, all libraries available to your Python interpreter can be imported and used. However, a core benefit of Metaflow is that the same code can run in different environments without modifications. This promise breaks down if step code depends on locally installed libraries that may not be available elsewhere. Reproducibility is a core value of Machine Learning Infrastructure. Without proper dependency management, it’s difficult to:
  • Collaborate on data science projects effectively
  • Reproduce past results reliably
  • Access artifacts created in specific environments months later
Metaflow Conda v2 addresses three critical questions:
1
Make dependencies available locally
2
How to make external dependencies available locally during development?
3
Execute remotely with dependencies
4
How to execute code remotely on Batch or with Argo/Airflow/StepFunctions with external dependencies?
5
Ensure reproducibility
6
How to ensure that anyone can reproduce past results or access artifacts even months later?

Main Improvements Over Standard Conda

Conda v2 provides significant enhancements over the standard Metaflow Conda decorator:

Mixed Package Support

  • Mix Conda and PyPI packages in the same environment
  • Support for a wider range of PyPI package sources (repositories, source tarballs, etc.)
  • Pure PyPI, pure Conda, or mixed mode environments

Command Line Tools

  • Retrieve and re-hydrate any environment used by any previously executed step
  • Resolve environments using standard requirements.txt or environment.yml files
  • Inspect packages present in any environment previously resolved

Named Environments

  • Create named environments for easy sharing
  • Use aliases like Docker tags to reference environments
  • Share environments between flows and team members

Performance

  • More efficient parallel resolution and downloading of packages
  • Support for .conda format packages (faster than .tar.bz2)
  • Intelligent caching to S3/Azure/GCS for faster retrieval

Resolver Support

  • Support for conda, mamba, and micromamba resolvers
  • Pure PyPI resolution with pip or uv
  • Mixed mode resolution with conda-lock

Architecture Overview

Three Distinct Phases

Conda v2 clearly separates environment management into three phases:
Resolving the environment: Convert user requirements (e.g., pandas>=1.0) into a fully resolved environment with pinned versions.
  • Takes place on your local machine
  • Does not require downloading packages
  • Creates a reproducible specification
  • Generates two identifiers: req_id (requirements hash) and full_id (resolved packages hash)

Environment Identification

Each environment is identified by two hashes:
  • req_id: Hash of user requirements (packages, versions, channels)
  • full_id: Hash of all resolved packages
Environments are named as metaflow_<req_id>_<full_id>.
Multiple resolved environments can share the same req_id if they were resolved at different times or with different underlying package versions available.

Use Cases

Development and Debugging

  • Create reproducible local environments matching remote execution
  • Debug issues by recreating the exact environment a step ran in
  • Create Jupyter kernels with step-specific dependencies

Production Workflows

  • Ensure consistent environments across local development, remote execution, and production
  • Share environments across team members
  • Version control your environments using named aliases

Multi-Architecture Support

  • Resolve environments for different architectures (linux-64, osx-64, osx-arm64)
  • Develop on Mac, deploy to Linux
  • Cross-platform reproducibility

Terminology

A Python package you need to use in your code (e.g., numpy, pandas).
Packages that your dependencies depend on, creating a chain of requirements.
Version specifications for dependencies (e.g., >=1.5.0, <2.0).
The process of determining the full set of packages needed (including all transitive dependencies) and pinning them to specific versions.
A set of resolved packages that provides the packages requested and all their transitive dependencies. Can be created using venv (PyPI packages only) or conda (PyPI, Conda packages, and non-Python libraries).

When to Use Conda v2

If you need to depend on external libraries that are fairly large and/or have a large set of transitive dependencies, Conda v2 is the solution for you.
If you only need a small set of Python files, consider including them directly with your code by placing them in the same directory as your flow file. Metaflow will automatically package them for remote execution.

Next Steps

Getting Started

Start using Conda v2 with basic examples

Decorators

Learn about all available decorators

Named Environments

Share and reuse environments

CLI Reference

Command-line tools for environment management

Build docs developers (and LLMs) love