Skip to main content

What are Azure Open Datasets?

Azure Open Datasets are curated public datasets that you can add to scenario-specific features to machine learning solutions, for more accurate models. Open Datasets are available in the cloud, on Microsoft Azure. They’re integrated into Azure Machine Learning and readily available to Azure Databricks. You can also access the datasets through APIs and use them in other products, such as Power BI and Azure Data Factory. Datasets include public-domain data for weather, census, holidays, public safety, and location that help you train machine learning models and enrich predictive solutions. You can also share your public datasets through Azure Open Datasets.

Curated and Prepared Datasets

Curated open public datasets in Azure Open Datasets are optimized for consumption in machine learning workflows. Data scientists often spend most of their time cleaning and preparing data for advanced analytics. To save you time, Open Datasets are copied to the Azure cloud, and then preprocessed. At regular intervals, data is pulled from the sources - for example, by an FTP connection to the National Oceanic and Atmospheric Administration (NOAA). Next, the data is parsed into a structured format, and then enriched as needed, with features such as ZIP Code or the locations of the nearest weather stations. Datasets are cohosted with cloud compute in Azure, to make access and manipulation easier.

Available Dataset Categories

Transportation

Taxi trip records from New York City, including yellow and green taxi data with pickup/drop-off times, locations, distances, fares, and passenger counts.

Health and Genomics

COVID-19 datasets tracking cases, deaths, and recoveries worldwide, plus genomics datasets including 1000 Genomes, gnomAD, and ClinVar for genetic variation research.

Labor and Economics

US labor statistics, employment data, consumer price index, and producer price index datasets for economic analysis and forecasting.

Population and Safety

US population data by county and ZIP code, plus 311 safety data from major cities including Boston, Chicago, New York, San Francisco, and Seattle.

Supplemental Datasets

Common datasets for machine learning including MNIST handwritten digits, Diabetes dataset, public holidays data covering 38 countries, and simulated sales data.

Access Methods

With an Azure account, you can access open datasets through code or through the Azure service interface. The data is colocated with Azure cloud compute resources for use in your machine learning solutions.

Python SDK

Access datasets programmatically using the azureml-opendatasets Python package:
from azureml.opendatasets import PublicHolidays
from datetime import datetime
from dateutil.relativedelta import relativedelta

end_date = datetime.today()
start_date = datetime.today() - relativedelta(months=1)
hol = PublicHolidays(start_date=start_date, end_date=end_date)
hol_df = hol.to_pandas_dataframe()

Azure Machine Learning

Open Datasets are available through the Azure Machine Learning UI and SDK. You can create datasets from Open Datasets and use them in your ML experiments.

Azure Databricks

Access datasets in Azure Databricks notebooks using the Python SDK or direct access to Azure Blob Storage.

Direct Access

You don’t need an Azure account to access Open Datasets - you can access them from any Python environment with or without Spark.

Benefits

No Extra Storage Cost

Datasets are lazily evaluated and data remains in its existing location

Improved Performance

ML workflow performance speeds are optimized through data cohosting

Data Integrity

No risk of unintentionally changing your original data sources

Regular Updates

Datasets are updated at regular intervals from trusted sources

Request or Contribute Datasets

If you can’t find the data you want, you can:

Next Steps

Browse Catalog

Explore all available datasets in the catalog

Create Dataset

Learn how to create datasets from Open Datasets

Public Holidays

View the Public Holidays dataset

Python SDK Reference

View the Python SDK documentation

Build docs developers (and LLMs) love