Genomics Datasets
Azure Open Datasets hosts several large-scale genomics datasets for research in human genetic variation, population genomics, and clinical genetics.Available Datasets
1000 Genomes
Public catalog of human variation and genotype data with 2,504 individuals from 26 populations
gnomAD
Genome Aggregation Database harmonizing exome and genome sequencing data
ClinVar
Public archive of relationships between human variations and phenotypes
ENCODE
Encyclopedia of DNA elements and functional genomics data
1000 Genomes Project
The 1000 Genomes Project ran between 2008 and 2015 to create the largest public catalog of human variation and genotype data.Overview
- Individuals: 2,504 from 26 populations
- Variants Identified: 84 million
- Data Volume: Approximately 815 TB
- Update Frequency: Daily
- Format: VCF, Parquet
Storage Location
This dataset is stored in the West US 2 and West Central US Azure regions. We recommend locating compute resources in these regions for affinity.Data Access URLs
- West US 2:
https://dataset1000genomes.blob.core.windows.net/dataset - West Central US:
https://dataset1000genomes-secondary.blob.core.windows.net/dataset
Key Publications
- Pilot Analysis: A map of human genome variation from population-scale sequencing - Nature 467, 1061-1073 (2010)
- Phase 1 Analysis: An integrated map of genetic variation from 1,092 human genomes - Nature 491, 56-65 (2012)
- Phase 3 Analysis: A global reference for human genetic variation - Nature 526, 68-74 (2015)
Parquet Format Access
The dataset is now available in optimized Parquet format for faster analysis:Use Terms
Following the final publications, data from the 1000 Genomes Project is publicly available without embargo. Use of the data should be cited per details in the 1000 Genomes FAQ.Genome Aggregation Database (gnomAD)
The Genome Aggregation Database (gnomAD) aggregates and harmonizes exome and genome sequencing data from large-scale sequencing projects.Overview
- Data Volume: Approximately 30 TB
- Update Frequency: With each gnomAD release
- Hosted By: Broad Institute collaboration
- Storage Region: East US
Data Access
Storage Account:https://datasetgnomad.blob.core.windows.net/dataset/
The data is available publicly without restrictions. Use the AzCopy tool for bulk operations:
View VCFs in Release 3.0
Download All VCFs Recursively
Parquet Format (NEW)
Parquet format of gnomAD v2.1.1 VCF files (exomes and genomes) is now available:View Parquet Files
Download Parquet Files
Access with Azure Storage Explorer
You can also use Azure Storage Explorer to browse the list of files in the gnomAD release.Use Terms
Data is available without restrictions. For more information and citation details, visit the gnomAD about page.Contact
For questions or feedback, contact the gnomAD team.ClinVar Annotations
ClinVar is a freely accessible public archive of reports about relationships between human variations and phenotypes.Overview
- Source: National Library of Medicine
- Update Frequency: Daily
- Storage Regions: West US 2, West Central US
- Content: Clinical interpretations of genetic variants
Data Access URLs
- West US 2:
https://datasetclinvar.blob.core.windows.net/dataset - West Central US:
https://datasetclinvar-secondary.blob.core.windows.net/dataset
Key Features
- Reports with supporting evidence about human variations and phenotypes
- Relationships between human variation and observed health status
- History of interpretations
- Broader set of clinical interpretations for genomics workflows
Data Source
This dataset is a mirror of the National Library of Medicine ClinVar FTP resource:Documentation
Python Access Example
Download Specific Files
Use Terms
Data is available without restrictions. For more information, see Accessing and using data in ClinVar.Contact
For questions or feedback, email [email protected].Common Analysis Patterns
Query Variant Data
Compare Populations
Variant Effect Analysis
Data Formats
VCF (Variant Call Format)
Standard format for storing gene sequence variations:Parquet
Optimized columnar format for analytics:- Faster query performance
- Better compression
- Schema evolution support
- Column pruning and predicate pushdown
Use Cases
Population Genetics Research
Population Genetics Research
Study genetic variation across different populations to understand human evolution and migration patterns.
Clinical Variant Interpretation
Clinical Variant Interpretation
Assess the clinical significance of genetic variants found in patient samples for diagnosis and treatment.
Drug Target Discovery
Drug Target Discovery
Identify genetic variants associated with disease to discover new drug targets and therapeutic approaches.
Genome-Wide Association Studies (GWAS)
Genome-Wide Association Studies (GWAS)
Analyze genetic variants across the genome to identify associations with traits and diseases.
Variant Frequency Analysis
Variant Frequency Analysis
Compare allele frequencies across populations to understand genetic diversity and identify rare variants.
Tools and Resources
AzCopy
Command-line tool for bulk data transfer:Azure Storage Explorer
GUI application for browsing and managing Azure Storage:Python Libraries
Recommended libraries for genomics analysis:Performance Optimization
- Parquet vs VCF
- Region Affinity
- Distributed Processing
Parquet format provides significant performance improvements:
- 10-100x faster query performance
- 50-80% better compression
- Column-level operations without reading entire files
Citations
1000 Genomes
gnomAD
ClinVar
Next Steps
Create ML Dataset
Learn how to create Azure ML datasets from genomics data
Browse Catalog
Explore other available datasets
Public Holidays
View public holidays dataset
COVID-19 Data
Access COVID-19 tracking data