Azure Open Datasets Catalog
Improve the accuracy of your machine learning models with publicly available datasets. To save time on data discovery and preparation, use curated datasets that are ready for machine learning projects.Transportation
| Dataset | Description |
|---|---|
| TartanAir: AirSim Simulation Dataset | AirSim autonomous vehicle data generated to solve Simultaneous Localization and Mapping (SLAM) |
| NYC Taxi - Yellow | Yellow taxi trip records with pickup/drop-off dates/times, locations, distances, fares, rate types, payment types, and passenger counts |
| NYC Taxi - Green | Green taxi trip records with pickup/drop-off dates/times, locations, distances, fares, rate types, payment types, and passenger counts |
| NYC Taxi - For-Hire Vehicle | For-Hire Vehicle trip records including dispatching base license number, pickup date, time, and taxi zone location ID |
Health and Genomics
| Dataset | Description |
|---|---|
| COVID-19 Data Lake | Collection of COVID-19 related datasets covering testing, patient outcomes, social distancing policy, hospital capacity, and mobility |
| Bing COVID-19 | Daily updates of confirmed, fatal, and recovered COVID-19 cases from all regions worldwide |
| 1000 Genomes | Largest public catalog of human variation and genotype data with 2,504 individuals from 26 populations and 84 million variants |
| gnomAD | Genome Aggregation Database harmonizing exome and genome sequencing data from large-scale projects |
| ClinVar Annotations | Public archive of relationships between human variations and phenotypes |
Labor and Economics
| Dataset | Description |
|---|---|
| US Labor Force Statistics | Labor force statistics, participation rates, and civilian noninstitutional population by age, gender, race, and ethnic groups |
| US National Employment Hours and Earnings | Detailed industry estimates of nonfarm employment, hours, and earnings of workers on payrolls |
| US State Employment Hours and Earnings | State-level industry estimates of nonfarm employment, hours, and earnings |
| US Local Area Unemployment Statistics | Monthly and annual employment, unemployment, and labor force data for regions, states, counties, and metropolitan areas |
| US Consumer Price Index | Average change over time in prices paid by urban consumers for consumer goods and services |
| US Producer Price Index - Industry | Average change in selling prices received by domestic producers by industry |
| US Producer Price Index - Commodities | Average change in selling prices received by domestic producers by commodity |
Population and Safety
| Dataset | Description |
|---|---|
| US Population by County | US population by gender and race for each county from Census data |
| US Population by ZIP Code | US population by gender and race for each ZIP code from Census data |
| Boston Safety Data | 311 calls reported to the city of Boston (daily updates, Parquet format) |
| Chicago Safety Data | 311 calls reported to the city of Chicago (daily updates, Parquet format) |
| New York City Safety Data | All NYC 311 service requests from 2010 to present (daily updates, Parquet format) |
| San Francisco Safety Data | Fire department calls for service and 311 cases in San Francisco from 2015 to present |
| Seattle Safety Data | Seattle Fire Department 911 dispatches from 2010 to present |
Supplemental and Common Datasets
| Dataset | Description |
|---|---|
| Diabetes | 442 samples with 10 features, ideal for getting started with machine learning algorithms |
| OJ Sales Simulated Data | Simulated sales data designed for training thousands of models simultaneously |
| MNIST | Database of handwritten digits with 60,000 training examples and 10,000 test examples |
| Microsoft News Recommendation | Large-scale dataset for news recommendation research and recommender systems |
| Public Holidays | Worldwide public holiday data for 38 countries from 1970 to 2099 |
| Russian Open Speech to Text | Large-scale open speech to text dataset for the Russian language |
Dataset Formats
Datasets are available in multiple formats:- Parquet: Optimized columnar format for analytics workloads
- CSV: Universal compatibility for data analysis tools
- JSON/JSON-Lines: Flexible format for nested data structures
- Tabular Dataset: Azure ML dataset type for structured data
- File Dataset: Azure ML dataset type for file-based workflows
Access Patterns
Python SDK Access
Direct Blob Storage Access
Azure Databricks Access
Getting Started
Create Your First Dataset
Learn how to create Azure ML datasets from Open Datasets
Public Holidays Dataset
Explore the worldwide public holidays dataset
COVID-19 Data
Access COVID-19 tracking data from around the world
Genomics Datasets
Browse genomics datasets for research