Azure Open Datasets

Azure Open Datasets are curated public datasets that you can add scenario-specific features to machine learning solutions for more accurate models. Open Datasets are available in the cloud on Microsoft Azure, integrated into Azure Machine Learning, and readily available to Azure Databricks.

Pre-Processed Data

Cleaned and structured datasets ready for ML workflows

Regular Updates

Automatically refreshed from authoritative sources

Azure Integration

Native support in Azure ML and Databricks

Free Access

No cost to access and use datasets

What are Azure Open Datasets?

Open Datasets are curated public datasets optimized for consumption in machine learning workflows. Datasets include public-domain data for weather, census, holidays, public safety, and location that help you train machine learning models and enrich predictive solutions.

Key Benefit: Data scientists often spend most of their time cleaning and preparing data. Azure Open Datasets are already preprocessed and optimized, saving significant time.

Available Dataset Categories

Transportation

Real-world transportation data for demand forecasting and logistics:

NYC Taxi - Yellow Cab

Yellow taxi trip records from New York City Taxi & Limousine Commission.Data Includes:

Pick-up and drop-off dates/times
Pick-up and drop-off locations (latitude/longitude)
Trip distances
Itemized fares
Rate types
Payment types
Driver-reported passenger counts

Use Cases:

Demand forecasting
Route optimization
Pricing models
Urban planning analysis

from azureml.opendatasets import NycTlcYellow
from datetime import datetime

# Load data for specific period
start_date = datetime(2023, 1, 1)
end_date = datetime(2023, 1, 31)

yellow_taxi = NycTlcYellow(start_date, end_date)
df = yellow_taxi.to_pandas_dataframe()

print(f"Loaded {len(df)} taxi trips")
print(df.head())

NYC Taxi - Green Cab

Green taxi trip records serving areas not covered by yellow cabs.Data Includes:

Similar structure to yellow cab data
Focus on outer boroughs
Trip and fare details
Payment information

Use Cases:

Underserved area analysis
Multi-modal transportation planning
Fare comparison studies

from azureml.opendatasets import NycTlcGreen

green_taxi = NycTlcGreen(start_date, end_date)
df = green_taxi.to_pandas_dataframe()

Labor and Economics

Economic indicators and labor statistics:

US Labor Force Statistics

Comprehensive labor force data from the Bureau of Labor Statistics.Data Includes:

Labor force participation rates
Employment/unemployment rates
Demographics (age, gender, race, ethnicity)
Civilian noninstitutional population
Seasonal adjustments

Use Cases:

Economic forecasting
Workforce planning
Regional analysis
Demographic trends

from azureml.opendatasets import UsLaborForce

labor_force = UsLaborForce()
df = labor_force.to_pandas_dataframe()

# Filter for specific demographics
women_data = df[df['gender'] == 'Women']
print(women_data.head())

US National Employment Hours and Earnings

Current Employment Statistics (CES) program data.Data Includes:

Nonfarm employment by industry
Average hours worked
Average hourly and weekly earnings
Production worker data
Monthly and annual data

Use Cases:

Wage analysis
Industry trend forecasting
Economic modeling
Policy research

from azureml.opendatasets import UsNationalEmployment

employment = UsNationalEmployment()
df = employment.to_pandas_dataframe()

# Analyze tech sector employment
tech_sector = df[df['industry'].str.contains('Technology')]

Weather and Climate

Historical and forecast weather data:

NOAA Weather Data

Weather observations from the National Oceanic and Atmospheric Administration.Data Includes:

Temperature (air, dew point)
Precipitation
Wind speed and direction
Atmospheric pressure
Visibility
Station location

Use Cases:

Demand forecasting (energy, retail)
Agricultural planning
Predictive maintenance
Supply chain optimization

from azureml.opendatasets import NoaaIsdWeather
from datetime import datetime

# Get weather for specific location and time
start = datetime(2023, 1, 1)
end = datetime(2023, 12, 31)

weather = NoaaIsdWeather(start, end)
df = weather.to_pandas_dataframe()

# Filter for a specific station
station_data = df[df['stationName'] == 'NEW YORK']

Demographics and Census

US Population Data

Census data including population by geography and demographics.Data Includes:

Population counts by geography
Age distributions
Gender breakdowns
Race and ethnicity
Housing statistics

Use Cases:

Market segmentation
Site selection
Resource allocation
Demographic forecasting

Public Safety

Boston Safety Data

Public safety incidents and responses.Data Includes:

Incident types and locations
Response times
Temporal patterns
Geographic distribution

Use Cases:

Resource optimization
Predictive policing (ethical considerations required)
Emergency planning
Urban safety analysis

Holidays

Public Holidays

Holiday calendars for multiple countries.Data Includes:

Holiday names and dates
Country/region
Holiday types (national, religious, etc.)

Use Cases:

Retail demand forecasting
Workforce planning
Event scheduling
Global business operations

from azureml.opendatasets import PublicHolidays

holidays = PublicHolidays(start_date, end_date)
df = holidays.to_pandas_dataframe()

# Filter for US holidays
us_holidays = df[df['countryOrRegion'] == 'United States']

How to Access Datasets

Python SDK

Access datasets directly in Python:

from azureml.opendatasets import NycTlcYellow, NoaaIsdWeather
from datetime import datetime
import pandas as pd

# Define date range
start = datetime(2023, 1, 1)
end = datetime(2023, 1, 31)

# Load taxi data
taxi = NycTlcYellow(start, end)
taxi_df = taxi.to_pandas_dataframe()

# Load weather data
weather = NoaaIsdWeather(start, end)
weather_df = weather.to_pandas_dataframe()

# Merge datasets
merged = pd.merge(
    taxi_df,
    weather_df,
    left_on='tpepPickupDateTime',
    right_on='datetime',
    how='left'
)

print(f"Combined dataset: {len(merged)} rows")

Azure Machine Learning

from azureml.core import Workspace, Dataset
from azureml.opendatasets import NycTlcYellow

# Connect to workspace
ws = Workspace.from_config()

# Load open dataset
taxi = NycTlcYellow(start_date, end_date)
taxi_df = taxi.to_pandas_dataframe()

# Register as tabular dataset
dataset = Dataset.Tabular.register_pandas_dataframe(
    dataframe=taxi_df,
    target=(ws.get_default_datastore(), 'taxi-data'),
    name='nyc-taxi-january-2023',
    description='NYC Yellow Taxi trips for January 2023'
)

print(f"Dataset registered: {dataset.name}")

Azure Databricks

Use datasets in Databricks notebooks:

from azureml.opendatasets import UsLaborForce

# Load dataset
labor = UsLaborForce()
df = labor.to_spark_dataframe()

# Analyze with Spark
df.createOrReplaceTempView("labor_force")

result = spark.sql("""
    SELECT year, AVG(unemployment_rate) as avg_unemployment
    FROM labor_force
    WHERE year >= 2020
    GROUP BY year
    ORDER BY year
""")

result.show()

Azure Notebooks

Access without installation:

# No installation required in Azure Notebooks
from azureml.opendatasets import NoaaIsdWeather
from datetime import datetime

start = datetime(2023, 6, 1)
end = datetime(2023, 6, 30)

weather = NoaaIsdWeather(start, end)
df = weather.to_pandas_dataframe()

# Visualize
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
plt.plot(df['datetime'], df['temperature'])
plt.title('Temperature Trends - June 2023')
plt.xlabel('Date')
plt.ylabel('Temperature (°F)')
plt.show()

Integration Patterns

Enrich ML Models

Combine Open Datasets with your data:

import pandas as pd
from azureml.opendatasets import NoaaIsdWeather, PublicHolidays
from datetime import datetime

# Your business data
sales_df = pd.read_csv('sales_data.csv')
sales_df['date'] = pd.to_datetime(sales_df['date'])

# Load weather data
start = sales_df['date'].min()
end = sales_df['date'].max()
weather = NoaaIsdWeather(start, end)
weather_df = weather.to_pandas_dataframe()

# Load holidays
holidays = PublicHolidays(start, end)
holidays_df = holidays.to_pandas_dataframe()

# Merge all data
enriched_df = sales_df.merge(
    weather_df[['datetime', 'temperature', 'precipitation']],
    left_on='date',
    right_on='datetime',
    how='left'
).merge(
    holidays_df[['date', 'holidayName']],
    on='date',
    how='left'
)

# Add features
enriched_df['is_holiday'] = enriched_df['holidayName'].notna()
enriched_df['temp_range'] = pd.cut(
    enriched_df['temperature'],
    bins=[0, 40, 60, 80, 100],
    labels=['cold', 'cool', 'warm', 'hot']
)

print("Enhanced dataset ready for ML training")

Time Series Forecasting

Use historical data for predictions:

from azureml.opendatasets import NycTlcYellow
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
import numpy as np

# Load historical taxi data
taxi = NycTlcYellow(datetime(2022, 1, 1), datetime(2023, 12, 31))
df = taxi.to_pandas_dataframe()

# Feature engineering
df['hour'] = df['tpepPickupDateTime'].dt.hour
df['day_of_week'] = df['tpepPickupDateTime'].dt.dayofweek
df['month'] = df['tpepPickupDateTime'].dt.month

# Aggregate to hourly demand
hourly = df.groupby(
    [df['tpepPickupDateTime'].dt.date, 'hour']
).size().reset_index(name='trip_count')

# Prepare features and target
X = hourly[['hour', 'day_of_week', 'month']]
y = hourly['trip_count']

# Train model
model = RandomForestRegressor(n_estimators=100)
model.fit(X, y)

# Forecast next week
future_dates = pd.date_range(
    start=datetime.now(),
    periods=24*7,
    freq='H'
)
future_X = pd.DataFrame({
    'hour': future_dates.hour,
    'day_of_week': future_dates.dayofweek,
    'month': future_dates.month
})

predictions = model.predict(future_X)
print(f"Weekly forecast: {predictions.sum():.0f} trips")

Data Processing Pipeline

How Open Datasets are maintained:

Source Ingestion

Data is pulled from authoritative sources at regular intervals (e.g., FTP from NOAA).

Parsing

Raw data is parsed into structured format (CSV, Parquet).

Enrichment

Additional features are added (e.g., ZIP codes, nearest weather stations).

Validation

Data quality checks ensure consistency and completeness.

Publishing

Datasets are cohosted with Azure compute for fast access.

Request or Contribute Datasets

Request a Dataset

Need data not currently available? Email the team with:

Dataset name and description
Source and licensing
Size and update frequency
Use cases

Request Dataset

Contribute a Dataset

Have a public dataset to share? Provide:

Dataset details and links
Licensing information
Update schedule
Expected growth

Contribute Dataset

Best Practices

Performance Optimization

Tips for working with large datasets:

Use date filtering to load only needed data
Convert to Parquet format for faster reads
Sample during development, full data in production
Use Spark for datasets > 1GB
Cache frequently used datasets

# Good: Load only needed dates
taxi = NycTlcYellow(start_date, end_date)

# Bad: Load entire history
# taxi = NycTlcYellow()  # Avoid

Data Quality

Always validate Open Dataset data:

Check for missing values
Verify date ranges
Validate geographic coordinates
Remove outliers
Handle timezone conversions

# Check data quality
print(f"Missing values:\n{df.isnull().sum()}")
print(f"Date range: {df['date'].min()} to {df['date'].max()}")
print(f"Outliers: {df[df['temperature'] > 150].shape[0]}")

Compliance and Privacy

All Open Datasets are public domain:

No personally identifiable information (PII)
Aggregated and anonymized data
Review licensing for each dataset
Cite sources in publications
Respect data provider terms

Pricing

Azure Open Datasets are completely free to access and use. You only pay for:

Compute resources to process data
Storage if you copy datasets
Network egress (minimal)

Resources

Dataset Catalog

Browse all available datasetsView Catalog

Python SDK Docs

Complete SDK referenceAPI Reference

How-To Guides

Learn with examplesStart Learning

GitHub Samples

Code examplesView Samples

Get Started

Azure AI Services

Azure Open Datasets Overview

Azure Open Datasets

Pre-Processed Data

Regular Updates

Azure Integration

Free Access

What are Azure Open Datasets?

Available Dataset Categories

Transportation

Labor and Economics

Weather and Climate

Demographics and Census

Public Safety

Holidays

How to Access Datasets

Python SDK

Azure Machine Learning

Azure Databricks

Azure Notebooks

Integration Patterns

Enrich ML Models

Time Series Forecasting

Data Processing Pipeline

Request or Contribute Datasets

Request a Dataset

Contribute a Dataset

Best Practices

Pricing

Resources

Dataset Catalog

Python SDK Docs

How-To Guides

GitHub Samples

Next Steps

Build docs developers (and LLMs) love

Get Started

Azure AI Services

​Azure Open Datasets

Pre-Processed Data

Regular Updates

Azure Integration

Free Access

​What are Azure Open Datasets?

​Available Dataset Categories

​Transportation

​Labor and Economics

​Weather and Climate

​Demographics and Census

​Public Safety

​Holidays

​How to Access Datasets

​Python SDK

​Azure Machine Learning

​Azure Databricks

​Azure Notebooks

​Integration Patterns

​Enrich ML Models

​Time Series Forecasting

​Data Processing Pipeline

​Request or Contribute Datasets

Request a Dataset

Contribute a Dataset

​Best Practices

​Pricing

​Resources

Dataset Catalog

Python SDK Docs

How-To Guides

GitHub Samples

​Next Steps

Build docs developers (and LLMs) love

Azure Open Datasets

What are Azure Open Datasets?

Available Dataset Categories

Transportation

Labor and Economics

Weather and Climate

Demographics and Census

Public Safety

Holidays

How to Access Datasets

Python SDK

Azure Machine Learning

Azure Databricks

Azure Notebooks

Integration Patterns

Enrich ML Models

Time Series Forecasting

Data Processing Pipeline

Request or Contribute Datasets

Best Practices

Pricing

Resources

Next Steps