Skip to main content

Genomics Datasets

Azure Open Datasets hosts several large-scale genomics datasets for research in human genetic variation, population genomics, and clinical genetics.

Available Datasets

1000 Genomes

Public catalog of human variation and genotype data with 2,504 individuals from 26 populations

gnomAD

Genome Aggregation Database harmonizing exome and genome sequencing data

ClinVar

Public archive of relationships between human variations and phenotypes

ENCODE

Encyclopedia of DNA elements and functional genomics data

1000 Genomes Project

The 1000 Genomes Project ran between 2008 and 2015 to create the largest public catalog of human variation and genotype data.

Overview

  • Individuals: 2,504 from 26 populations
  • Variants Identified: 84 million
  • Data Volume: Approximately 815 TB
  • Update Frequency: Daily
  • Format: VCF, Parquet

Storage Location

This dataset is stored in the West US 2 and West Central US Azure regions. We recommend locating compute resources in these regions for affinity.

Data Access URLs

  • West US 2: https://dataset1000genomes.blob.core.windows.net/dataset
  • West Central US: https://dataset1000genomes-secondary.blob.core.windows.net/dataset

Key Publications

  1. Pilot Analysis: A map of human genome variation from population-scale sequencing - Nature 467, 1061-1073 (2010)
  2. Phase 1 Analysis: An integrated map of genetic variation from 1,092 human genomes - Nature 491, 56-65 (2012)
  3. Phase 3 Analysis: A global reference for human genetic variation - Nature 526, 68-74 (2015)

Parquet Format Access

The dataset is now available in optimized Parquet format for faster analysis:
import pandas as pd
import pyarrow.parquet as pq

# Read parquet files directly
df = pd.read_parquet(
    "https://dataset1000genomes.blob.core.windows.net/dataset/parquet/..."
)
For more details on the Parquet conversion, see the genomicsnotebook repository.

Use Terms

Following the final publications, data from the 1000 Genomes Project is publicly available without embargo. Use of the data should be cited per details in the 1000 Genomes FAQ.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) aggregates and harmonizes exome and genome sequencing data from large-scale sequencing projects.

Overview

  • Data Volume: Approximately 30 TB
  • Update Frequency: With each gnomAD release
  • Hosted By: Broad Institute collaboration
  • Storage Region: East US

Data Access

Storage Account: https://datasetgnomad.blob.core.windows.net/dataset/ The data is available publicly without restrictions. Use the AzCopy tool for bulk operations:

View VCFs in Release 3.0

azcopy ls https://datasetgnomad.blob.core.windows.net/dataset/release/3.0/vcf/genomes

Download All VCFs Recursively

azcopy cp --recursive=true https://datasetgnomad.blob.core.windows.net/dataset/release/3.0/vcf/genomes .

Parquet Format (NEW)

Parquet format of gnomAD v2.1.1 VCF files (exomes and genomes) is now available:

View Parquet Files

azcopy ls https://datasetgnomadparquet.blob.core.windows.net/dataset

Download Parquet Files

azcopy cp --recursive=true https://datasetgnomadparquet.blob.core.windows.net/dataset .

Access with Azure Storage Explorer

You can also use Azure Storage Explorer to browse the list of files in the gnomAD release.

Use Terms

Data is available without restrictions. For more information and citation details, visit the gnomAD about page.

Contact

For questions or feedback, contact the gnomAD team.

ClinVar Annotations

ClinVar is a freely accessible public archive of reports about relationships between human variations and phenotypes.

Overview

  • Source: National Library of Medicine
  • Update Frequency: Daily
  • Storage Regions: West US 2, West Central US
  • Content: Clinical interpretations of genetic variants

Data Access URLs

  • West US 2: https://datasetclinvar.blob.core.windows.net/dataset
  • West Central US: https://datasetclinvar-secondary.blob.core.windows.net/dataset

Key Features

  • Reports with supporting evidence about human variations and phenotypes
  • Relationships between human variation and observed health status
  • History of interpretations
  • Broader set of clinical interpretations for genomics workflows

Data Source

This dataset is a mirror of the National Library of Medicine ClinVar FTP resource:

Documentation

Python Access Example

from azureml.core import Dataset
import os
import pandas as pd

# Access ClinVar dataset
reference_dataset = Dataset.File.from_files(
    'https://datasetclinvar.blob.core.windows.net/dataset'
)
mount = reference_dataset.mount()

REF_DIR = '/dataset'
path = mount.mount_point + REF_DIR

with mount:
    print(os.listdir(path))

Download Specific Files

from azure.storage.blob import BlockBlobService

blob_service_client = BlockBlobService(
    account_name='datasetclinvar',
    sas_token='sv=2019-02-02&se=2050-01-01T08%3A00%3A00Z&si=prod&sr=c&sig=qFPPwPba1RmBvaffkzkLuzabYU5dZstSTgMwxuLNME8%3D'
)

blob_service_client.get_blob_to_path(
    'dataset',
    'ClinVarFullRelease_00-latest.xml.gz.md5',
    './ClinVarFullRelease_00-latest.xml.gz.md5'
)

Use Terms

Data is available without restrictions. For more information, see Accessing and using data in ClinVar.

Contact

For questions or feedback, email [email protected].

Common Analysis Patterns

Query Variant Data

import pandas as pd
import pyarrow.parquet as pq

# Read variant data from parquet
df = pd.read_parquet("path/to/variants.parquet")

# Filter by chromosome
chr1_variants = df[df['CHROM'] == '1']

# Filter by allele frequency
common_variants = df[df['AF'] > 0.01]

# Filter by quality score
high_quality = df[df['QUAL'] > 30]

Compare Populations

# Compare allele frequencies across populations
population_comparison = df.groupby('POP').agg({
    'AF': 'mean',
    'AC': 'sum',
    'AN': 'sum'
})

print("Population Statistics:")
print(population_comparison)

Variant Effect Analysis

# Analyze variant consequences
consequence_counts = df['Consequence'].value_counts()

print("\nTop 10 Variant Consequences:")
print(consequence_counts.head(10))

# Filter for high-impact variants
high_impact = df[
    df['IMPACT'].isin(['HIGH', 'MODERATE'])
]

Data Formats

VCF (Variant Call Format)

Standard format for storing gene sequence variations:
##fileformat=VCFv4.2
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE
1       10177   .       A       AC      100     PASS    AC=2130;AF=0.425319 GT  0|1

Parquet

Optimized columnar format for analytics:
  • Faster query performance
  • Better compression
  • Schema evolution support
  • Column pruning and predicate pushdown

Use Cases

Study genetic variation across different populations to understand human evolution and migration patterns.
Assess the clinical significance of genetic variants found in patient samples for diagnosis and treatment.
Identify genetic variants associated with disease to discover new drug targets and therapeutic approaches.
Analyze genetic variants across the genome to identify associations with traits and diseases.
Compare allele frequencies across populations to understand genetic diversity and identify rare variants.

Tools and Resources

AzCopy

Command-line tool for bulk data transfer:
# Install AzCopy
wget https://aka.ms/downloadazcopy-v10-linux
tar -xvf downloadazcopy-v10-linux

# Use AzCopy to download data
./azcopy copy "https://dataset1000genomes.blob.core.windows.net/dataset/*" "." --recursive

Azure Storage Explorer

GUI application for browsing and managing Azure Storage:

Python Libraries

Recommended libraries for genomics analysis:
pip install pandas pyarrow pysam biopython scikit-allel

Performance Optimization

Parquet format provides significant performance improvements:
  • 10-100x faster query performance
  • 50-80% better compression
  • Column-level operations without reading entire files

Citations

1000 Genomes

The 1000 Genomes Project Consortium.
A global reference for human genetic variation.
Nature 526, 68-74 (2015).
https://doi.org/10.1038/nature15393

gnomAD

Karczewski, K.J., Francioli, L.C., Tiao, G. et al.
The mutational constraint spectrum quantified from variation in 141,456 humans.
Nature 581, 434–443 (2020).
https://doi.org/10.1038/s41586-020-2308-7

ClinVar

Landrum, M.J., Lee, J.M., Benson, M. et al.
ClinVar: improving access to variant interpretations and supporting evidence.
Nucleic Acids Res. 46(D1):D1062-D1067 (2018).
https://doi.org/10.1093/nar/gkx1153

Next Steps

Create ML Dataset

Learn how to create Azure ML datasets from genomics data

Browse Catalog

Explore other available datasets

Public Holidays

View public holidays dataset

COVID-19 Data

Access COVID-19 tracking data

Additional Resources

Build docs developers (and LLMs) love