Create Azure Machine Learning Datasets from Azure Open Datasets
Learn how to bring curated enrichment data into your local or remote machine learning experiments with Azure Machine Learning datasets and Azure Open Datasets.Overview
With an Azure Machine Learning dataset, you create a reference to the data source location, along with a copy of its metadata. Because datasets are lazily evaluated, and because the data remains in its existing location, you:- Don’t risk unintentionally changing your original data sources
- Incur no extra storage cost
- Improve ML workflow performance speeds
Prerequisites
You need:- An Azure subscription. If you don’t have one, create a free account
- An Azure Machine Learning workspace
- The Azure Machine Learning SDK for Python installed:
Some dataset classes have dependencies on the
azureml-dataprep package, which is only compatible with 64-bit Python.For Linux users, these classes are supported only on: Debian (8, 9), Fedora (27, 28), Red Hat Enterprise Linux (7, 8), and Ubuntu (14.04, 16.04, 18.04).Create Datasets with the Python SDK
To create Azure Machine Learning datasets via Azure Open Datasets classes in the Python SDK, make sure you installed the package:Return TabularDataset or FileDataset
Each discrete dataset is represented by a class, and certain classes are available as either:FileDataset- references single or multiple files in datastores or public URLsTabularDataset- represents data in a tabular format
Example: MNIST Dataset
The MNIST class can return either aTabularDataset or FileDataset:
Example: Diabetes Dataset
The Diabetes class is only available as aTabularDataset:
Example: Public Holidays Dataset
Access worldwide public holiday data:Register Datasets
Register an Azure Machine Learning dataset with your workspace to share it with others and reuse it across experiments. When you register a dataset created from Open Datasets, no data is immediately downloaded, but the data becomes accessible later when requested from a central storage location.Create Datasets with Azure ML Studio
You can also create Azure Machine Learning datasets from Azure Open Datasets using the Azure Machine Learning studio interface.Datasets created through Azure Machine Learning studio are automatically registered to the workspace.
Step 1: Navigate to Data Assets
In your workspace, select Data in the left navigation. At the Data assets tab, select Create.Step 2: Configure Dataset Type
- Add a name and optional description for the new data asset
- Select Tabular in the Type dropdown
- Click Next
Step 3: Select Source
- Select From Azure Open Datasets
- Click Next
Step 4: Choose Dataset
Select an available Azure Open Dataset from the list. For example:- San Francisco Safety Data
- US Labor Force Statistics
- Public Holidays
- NYC Taxi datasets
Step 5: Filter Data (Optional)
Optionally filter the data with available filters. For example, for the San Francisco Safety Data dataset, you can filter by date range:Step 6: Review and Create
- Review the settings for the new data asset
- Make any necessary changes
- Select Create
Direct Access with Azure Storage
You can also access datasets directly from Azure Blob Storage without the SDK:Using Azure Storage SDK
Access Datasets in Azure Databricks
Using azureml-opendatasets
Using PySpark Direct Access
Use Datasets in ML Experiments
Once you’ve created and registered datasets, you can use them in your machine learning experiments:Best Practices
Use Lazy Evaluation
Use Lazy Evaluation
Datasets use lazy evaluation - data is only loaded when needed. This improves performance and reduces memory usage.
Register Datasets
Register Datasets
Register datasets in your workspace for versioning, sharing, and reproducibility across experiments.
Filter Data Early
Filter Data Early
When using the SDK, filter data at query time rather than loading entire datasets into memory.
Choose the Right Format
Choose the Right Format
Use Parquet for large datasets and analytics workloads. Use CSV for compatibility with external tools.
Next Steps
Train Your First ML Model
Learn how to train models with datasets
Dataset Catalog
Browse all available datasets
Public Holidays Dataset
Explore the public holidays dataset
Python SDK Reference
View full SDK documentation