Skip to main content

Supported Data Formats

HeartMAP supports standard single-cell RNA-seq data formats used in the bioinformatics community.

Primary Format: AnnData (.h5ad)

HeartMAP primarily works with AnnData objects stored in .h5ad (HDF5) format, which is the standard format used by Scanpy and the single-cell analysis ecosystem.
AnnData is a Python package for handling annotated data matrices in memory and on disk. It’s specifically designed for single-cell data and provides efficient storage and manipulation.

Required Data Structure

Your .h5ad file should contain:
  • Expression matrix: Genes × Cells count data
  • Cell metadata (.obs): Cell annotations, quality metrics
  • Gene metadata (.var): Gene names, feature information
  • Chamber information: For multi-chamber analysis, include chamber labels (RA, RV, LA, LV)

Example Data Loading

import scanpy as sc

# Read your data
adata = sc.read_h5ad('your_heart_data.h5ad')

# Verify data structure
print(f"Cells: {adata.n_obs}")
print(f"Genes: {adata.n_vars}")
print(f"Observations: {adata.obs.columns.tolist()}")

Reference Dataset: Single Cell Portal SCP498

HeartMAP was developed and validated using the comprehensive human heart dataset available at Single Cell Portal.

Dataset Details

Human Heart Cell Atlas

Single Cell Portal: SCP498
  • Total cells: 287,269 single cells
  • Donors: 7 healthy human hearts
  • Chambers: All four cardiac chambers (RA, RV, LA, LV)
  • Technology: Single-cell RNA-sequencing

Chamber Distribution

The SCP498 dataset contains cells from all four cardiac chambers:
ChamberCell CountPercentage
Right Atrium (RA)~81,50028.4%
Left Ventricle (LV)~77,50027.0%
Left Atrium (LA)~75,80026.4%
Right Ventricle (RV)~52,30018.2%

Chamber-Specific Markers

Key molecular markers identified in the dataset:
  • NPPA: Atrial natriuretic peptide
  • MIR100HG: MicroRNA host gene
  • MYL7: Myosin light chain 7
  • MYL4: Myosin light chain 4
  • PDE4D: Phosphodiesterase 4D
  • NEAT1: Nuclear paraspeckle assembly transcript 1
  • MYH7: Myosin heavy chain 7
  • FHL2: Four and a half LIM domains 2
  • C15orf41: Chromosome 15 open reading frame 41
  • PCDH7: Protocadherin 7
  • NPPA: Atrial natriuretic peptide
  • ELN: Elastin
  • MYL7: Myosin light chain 7
  • EBF2: Early B-cell factor 2
  • RORA: RAR-related orphan receptor A
  • CD36: Fatty acid translocase
  • LINC00486: Long intergenic non-protein coding RNA
  • FHL2: Four and a half LIM domains 2
  • RP11-532N4.2: Long non-coding RNA
  • MYH7: Myosin heavy chain 7

Preparing Your Own Data

Data Requirements

For optimal results with HeartMAP, your data should meet these criteria:
1

Quality Control

  • Minimum genes per cell: 200 (configurable)
  • Minimum cells per gene: 3 (configurable)
  • Quality filtered for doublets and low-quality cells
2

Data Format

  • Stored as .h5ad file
  • Raw or normalized counts in .X
  • Cell metadata in .obs
  • Gene names in .var
3

Chamber Annotation

For multi-chamber analysis, include chamber labels:
  • Column name: chamber or region in .obs
  • Labels: RA, RV, LA, LV (standard abbreviations)

Converting Other Formats

import scanpy as sc
import anndata2ri
from rpy2.robjects import r

# Activate anndata2ri
anndata2ri.activate()

# Load Seurat object from R
r('library(Seurat)')
r('seurat_obj <- readRDS("your_data.rds")')

# Convert to AnnData
adata = r('as.SingleCellExperiment(seurat_obj)')

# Save as h5ad
adata.write_h5ad('converted_data.h5ad')

Example Datasets

Mock Data for Testing

HeartMAP includes functionality to generate mock data for testing:
from heartmap import Config
from heartmap.pipelines import ComprehensivePipeline

# Create configuration with test mode
config = Config.default()
config.data.test_mode = True

# Pipeline will generate mock data automatically
pipeline = ComprehensivePipeline(config)
results = pipeline.run('mock_data', 'test_results/')

Data Size Guidelines

Choose appropriate dataset sizes based on your system:
System RAMRecommended CellsRecommended GenesAnalysis Time
8GB10,0002,0005-10 minutes
16GB30,0004,00010-20 minutes
32GB50,0005,00020-30 minutes
64GB+100,000+10,000+30-60 minutes

Data Integrity

Checksums

HeartMAP supports SHA-256 checksums for data verification:
# Generate checksums for your data
python utils/sha256_checksum.py generate data/raw

# Verify data integrity
python utils/sha256_checksum.py verify data/raw data/raw/checksums.txt

Reproducibility

All analyses use fixed random seeds to ensure reproducible results:
  • Random sampling: seed = 42
  • Clustering algorithms: random_state = 42
  • Communication analysis: seed = 42
You can customize the random seed in your configuration file to use a different seed while maintaining reproducibility.

Need Help?

If you’re having trouble with your data format:

Build docs developers (and LLMs) love