Azure Open Datasets
Azure Open Datasets are curated public datasets that you can add scenario-specific features to machine learning solutions for more accurate models. Open Datasets are available in the cloud on Microsoft Azure, integrated into Azure Machine Learning, and readily available to Azure Databricks.Pre-Processed Data
Cleaned and structured datasets ready for ML workflows
Regular Updates
Automatically refreshed from authoritative sources
Azure Integration
Native support in Azure ML and Databricks
Free Access
No cost to access and use datasets
What are Azure Open Datasets?
Open Datasets are curated public datasets optimized for consumption in machine learning workflows. Datasets include public-domain data for weather, census, holidays, public safety, and location that help you train machine learning models and enrich predictive solutions.Key Benefit: Data scientists often spend most of their time cleaning and preparing data. Azure Open Datasets are already preprocessed and optimized, saving significant time.
Available Dataset Categories
Transportation
Real-world transportation data for demand forecasting and logistics:NYC Taxi - Yellow Cab
NYC Taxi - Yellow Cab
Yellow taxi trip records from New York City Taxi & Limousine Commission.Data Includes:
- Pick-up and drop-off dates/times
- Pick-up and drop-off locations (latitude/longitude)
- Trip distances
- Itemized fares
- Rate types
- Payment types
- Driver-reported passenger counts
- Demand forecasting
- Route optimization
- Pricing models
- Urban planning analysis
NYC Taxi - Green Cab
NYC Taxi - Green Cab
Green taxi trip records serving areas not covered by yellow cabs.Data Includes:
- Similar structure to yellow cab data
- Focus on outer boroughs
- Trip and fare details
- Payment information
- Underserved area analysis
- Multi-modal transportation planning
- Fare comparison studies
Labor and Economics
Economic indicators and labor statistics:US Labor Force Statistics
US Labor Force Statistics
Comprehensive labor force data from the Bureau of Labor Statistics.Data Includes:
- Labor force participation rates
- Employment/unemployment rates
- Demographics (age, gender, race, ethnicity)
- Civilian noninstitutional population
- Seasonal adjustments
- Economic forecasting
- Workforce planning
- Regional analysis
- Demographic trends
US National Employment Hours and Earnings
US National Employment Hours and Earnings
Current Employment Statistics (CES) program data.Data Includes:
- Nonfarm employment by industry
- Average hours worked
- Average hourly and weekly earnings
- Production worker data
- Monthly and annual data
- Wage analysis
- Industry trend forecasting
- Economic modeling
- Policy research
Weather and Climate
Historical and forecast weather data:NOAA Weather Data
NOAA Weather Data
Weather observations from the National Oceanic and Atmospheric Administration.Data Includes:
- Temperature (air, dew point)
- Precipitation
- Wind speed and direction
- Atmospheric pressure
- Visibility
- Station location
- Demand forecasting (energy, retail)
- Agricultural planning
- Predictive maintenance
- Supply chain optimization
Demographics and Census
US Population Data
US Population Data
Census data including population by geography and demographics.Data Includes:
- Population counts by geography
- Age distributions
- Gender breakdowns
- Race and ethnicity
- Housing statistics
- Market segmentation
- Site selection
- Resource allocation
- Demographic forecasting
Public Safety
Boston Safety Data
Boston Safety Data
Public safety incidents and responses.Data Includes:
- Incident types and locations
- Response times
- Temporal patterns
- Geographic distribution
- Resource optimization
- Predictive policing (ethical considerations required)
- Emergency planning
- Urban safety analysis
Holidays
Public Holidays
Public Holidays
Holiday calendars for multiple countries.Data Includes:
- Holiday names and dates
- Country/region
- Holiday types (national, religious, etc.)
- Retail demand forecasting
- Workforce planning
- Event scheduling
- Global business operations
How to Access Datasets
Python SDK
Access datasets directly in Python:Azure Machine Learning
Register datasets in your workspace:Azure Databricks
Use datasets in Databricks notebooks:Azure Notebooks
Access without installation:Integration Patterns
Enrich ML Models
Combine Open Datasets with your data:Time Series Forecasting
Use historical data for predictions:Data Processing Pipeline
How Open Datasets are maintained:Source Ingestion
Data is pulled from authoritative sources at regular intervals (e.g., FTP from NOAA).
Request or Contribute Datasets
Request a Dataset
Need data not currently available? Email the team with:
- Dataset name and description
- Source and licensing
- Size and update frequency
- Use cases
Contribute a Dataset
Have a public dataset to share? Provide:
- Dataset details and links
- Licensing information
- Update schedule
- Expected growth
Best Practices
Performance Optimization
Performance Optimization
Tips for working with large datasets:
- Use date filtering to load only needed data
- Convert to Parquet format for faster reads
- Sample during development, full data in production
- Use Spark for datasets > 1GB
- Cache frequently used datasets
Data Quality
Data Quality
Always validate Open Dataset data:
- Check for missing values
- Verify date ranges
- Validate geographic coordinates
- Remove outliers
- Handle timezone conversions
Compliance and Privacy
Compliance and Privacy
All Open Datasets are public domain:
- No personally identifiable information (PII)
- Aggregated and anonymized data
- Review licensing for each dataset
- Cite sources in publications
- Respect data provider terms
Pricing
Azure Open Datasets are completely free to access and use. You only pay for:
- Compute resources to process data
- Storage if you copy datasets
- Network egress (minimal)
Resources
Dataset Catalog
Browse all available datasetsView Catalog
Python SDK Docs
Complete SDK referenceAPI Reference
How-To Guides
Learn with examplesStart Learning
GitHub Samples
Code examplesView Samples