The Problem
When running Metaflow code locally, all libraries available to your Python interpreter can be imported and used. However, a core benefit of Metaflow is that the same code can run in different environments without modifications. This promise breaks down if step code depends on locally installed libraries that may not be available elsewhere. Reproducibility is a core value of Machine Learning Infrastructure. Without proper dependency management, it’s difficult to:- Collaborate on data science projects effectively
- Reproduce past results reliably
- Access artifacts created in specific environments months later
How to execute code remotely on Batch or with Argo/Airflow/StepFunctions with external dependencies?
Main Improvements Over Standard Conda
Conda v2 provides significant enhancements over the standard Metaflow Conda decorator:Mixed Package Support
- Mix Conda and PyPI packages in the same environment
- Support for a wider range of PyPI package sources (repositories, source tarballs, etc.)
- Pure PyPI, pure Conda, or mixed mode environments
Command Line Tools
- Retrieve and re-hydrate any environment used by any previously executed step
- Resolve environments using standard
requirements.txtorenvironment.ymlfiles - Inspect packages present in any environment previously resolved
Named Environments
- Create named environments for easy sharing
- Use aliases like Docker tags to reference environments
- Share environments between flows and team members
Performance
- More efficient parallel resolution and downloading of packages
- Support for
.condaformat packages (faster than.tar.bz2) - Intelligent caching to S3/Azure/GCS for faster retrieval
Resolver Support
- Support for conda, mamba, and micromamba resolvers
- Pure PyPI resolution with pip or uv
- Mixed mode resolution with conda-lock
Architecture Overview
Three Distinct Phases
Conda v2 clearly separates environment management into three phases:- Resolving
- Caching
- Hydrating
Resolving the environment: Convert user requirements (e.g.,
pandas>=1.0) into a fully resolved environment with pinned versions.- Takes place on your local machine
- Does not require downloading packages
- Creates a reproducible specification
- Generates two identifiers:
req_id(requirements hash) andfull_id(resolved packages hash)
Environment Identification
Each environment is identified by two hashes:req_id: Hash of user requirements (packages, versions, channels)full_id: Hash of all resolved packages
metaflow_<req_id>_<full_id>.
Multiple resolved environments can share the same
req_id if they were resolved at different times or with different underlying package versions available.Use Cases
Development and Debugging
- Create reproducible local environments matching remote execution
- Debug issues by recreating the exact environment a step ran in
- Create Jupyter kernels with step-specific dependencies
Production Workflows
- Ensure consistent environments across local development, remote execution, and production
- Share environments across team members
- Version control your environments using named aliases
Multi-Architecture Support
- Resolve environments for different architectures (linux-64, osx-64, osx-arm64)
- Develop on Mac, deploy to Linux
- Cross-platform reproducibility
Terminology
Dependency
Dependency
A Python package you need to use in your code (e.g.,
numpy, pandas).Transitive Dependencies
Transitive Dependencies
Packages that your dependencies depend on, creating a chain of requirements.
Constraints
Constraints
Version specifications for dependencies (e.g.,
>=1.5.0, <2.0).Resolving and Locking
Resolving and Locking
The process of determining the full set of packages needed (including all transitive dependencies) and pinning them to specific versions.
Environment
Environment
A set of resolved packages that provides the packages requested and all their transitive dependencies. Can be created using venv (PyPI packages only) or conda (PyPI, Conda packages, and non-Python libraries).
When to Use Conda v2
If you need to depend on external libraries that are fairly large and/or have a large set of transitive dependencies, Conda v2 is the solution for you.
Next Steps
Getting Started
Start using Conda v2 with basic examples
Decorators
Learn about all available decorators
Named Environments
Share and reuse environments
CLI Reference
Command-line tools for environment management
