Skip to main content

Azure Open Datasets Catalog

Improve the accuracy of your machine learning models with publicly available datasets. To save time on data discovery and preparation, use curated datasets that are ready for machine learning projects.

Transportation

DatasetDescription
TartanAir: AirSim Simulation DatasetAirSim autonomous vehicle data generated to solve Simultaneous Localization and Mapping (SLAM)
NYC Taxi - YellowYellow taxi trip records with pickup/drop-off dates/times, locations, distances, fares, rate types, payment types, and passenger counts
NYC Taxi - GreenGreen taxi trip records with pickup/drop-off dates/times, locations, distances, fares, rate types, payment types, and passenger counts
NYC Taxi - For-Hire VehicleFor-Hire Vehicle trip records including dispatching base license number, pickup date, time, and taxi zone location ID

Health and Genomics

DatasetDescription
COVID-19 Data LakeCollection of COVID-19 related datasets covering testing, patient outcomes, social distancing policy, hospital capacity, and mobility
Bing COVID-19Daily updates of confirmed, fatal, and recovered COVID-19 cases from all regions worldwide
1000 GenomesLargest public catalog of human variation and genotype data with 2,504 individuals from 26 populations and 84 million variants
gnomADGenome Aggregation Database harmonizing exome and genome sequencing data from large-scale projects
ClinVar AnnotationsPublic archive of relationships between human variations and phenotypes

Labor and Economics

DatasetDescription
US Labor Force StatisticsLabor force statistics, participation rates, and civilian noninstitutional population by age, gender, race, and ethnic groups
US National Employment Hours and EarningsDetailed industry estimates of nonfarm employment, hours, and earnings of workers on payrolls
US State Employment Hours and EarningsState-level industry estimates of nonfarm employment, hours, and earnings
US Local Area Unemployment StatisticsMonthly and annual employment, unemployment, and labor force data for regions, states, counties, and metropolitan areas
US Consumer Price IndexAverage change over time in prices paid by urban consumers for consumer goods and services
US Producer Price Index - IndustryAverage change in selling prices received by domestic producers by industry
US Producer Price Index - CommoditiesAverage change in selling prices received by domestic producers by commodity

Population and Safety

DatasetDescription
US Population by CountyUS population by gender and race for each county from Census data
US Population by ZIP CodeUS population by gender and race for each ZIP code from Census data
Boston Safety Data311 calls reported to the city of Boston (daily updates, Parquet format)
Chicago Safety Data311 calls reported to the city of Chicago (daily updates, Parquet format)
New York City Safety DataAll NYC 311 service requests from 2010 to present (daily updates, Parquet format)
San Francisco Safety DataFire department calls for service and 311 cases in San Francisco from 2015 to present
Seattle Safety DataSeattle Fire Department 911 dispatches from 2010 to present

Supplemental and Common Datasets

DatasetDescription
Diabetes442 samples with 10 features, ideal for getting started with machine learning algorithms
OJ Sales Simulated DataSimulated sales data designed for training thousands of models simultaneously
MNISTDatabase of handwritten digits with 60,000 training examples and 10,000 test examples
Microsoft News RecommendationLarge-scale dataset for news recommendation research and recommender systems
Public HolidaysWorldwide public holiday data for 38 countries from 1970 to 2099
Russian Open Speech to TextLarge-scale open speech to text dataset for the Russian language

Dataset Formats

Datasets are available in multiple formats:
  • Parquet: Optimized columnar format for analytics workloads
  • CSV: Universal compatibility for data analysis tools
  • JSON/JSON-Lines: Flexible format for nested data structures
  • Tabular Dataset: Azure ML dataset type for structured data
  • File Dataset: Azure ML dataset type for file-based workflows

Access Patterns

Python SDK Access

from azureml.opendatasets import PublicHolidays
from datetime import datetime

end_date = datetime.today()
start_date = datetime.today() - relativedelta(months=1)
hol = PublicHolidays(start_date=start_date, end_date=end_date)
hol_df = hol.to_pandas_dataframe()

Direct Blob Storage Access

import pandas as pd

# Read directly from Azure Blob Storage
df = pd.read_parquet(
    "https://azureopendatastorage.blob.core.windows.net/dataset/..."
)

Azure Databricks Access

# Configure blob storage access
blob_account_name = "azureopendatastorage"
wasbs_path = f"wasbs://{container_name}@{blob_account_name}.blob.core.windows.net/{path}"

# Read with Spark
df = spark.read.parquet(wasbs_path)

Getting Started

Create Your First Dataset

Learn how to create Azure ML datasets from Open Datasets

Public Holidays Dataset

Explore the worldwide public holidays dataset

COVID-19 Data

Access COVID-19 tracking data from around the world

Genomics Datasets

Browse genomics datasets for research

Need Help?

For questions or to request additional datasets, contact [email protected].

Build docs developers (and LLMs) love