Supported Data Formats
HeartMAP supports standard single-cell RNA-seq data formats used in the bioinformatics community.Primary Format: AnnData (.h5ad)
HeartMAP primarily works with AnnData objects stored in.h5ad (HDF5) format, which is the standard format used by Scanpy and the single-cell analysis ecosystem.
AnnData is a Python package for handling annotated data matrices in memory and on disk. It’s specifically designed for single-cell data and provides efficient storage and manipulation.
Required Data Structure
Your.h5ad file should contain:
- Expression matrix: Genes × Cells count data
- Cell metadata (
.obs): Cell annotations, quality metrics - Gene metadata (
.var): Gene names, feature information - Chamber information: For multi-chamber analysis, include chamber labels (RA, RV, LA, LV)
Example Data Loading
Reference Dataset: Single Cell Portal SCP498
HeartMAP was developed and validated using the comprehensive human heart dataset available at Single Cell Portal.Dataset Details
Human Heart Cell Atlas
Single Cell Portal: SCP498
- Total cells: 287,269 single cells
- Donors: 7 healthy human hearts
- Chambers: All four cardiac chambers (RA, RV, LA, LV)
- Technology: Single-cell RNA-sequencing
Chamber Distribution
The SCP498 dataset contains cells from all four cardiac chambers:| Chamber | Cell Count | Percentage |
|---|---|---|
| Right Atrium (RA) | ~81,500 | 28.4% |
| Left Ventricle (LV) | ~77,500 | 27.0% |
| Left Atrium (LA) | ~75,800 | 26.4% |
| Right Ventricle (RV) | ~52,300 | 18.2% |
Chamber-Specific Markers
Key molecular markers identified in the dataset:Right Atrium (RA)
Right Atrium (RA)
- NPPA: Atrial natriuretic peptide
- MIR100HG: MicroRNA host gene
- MYL7: Myosin light chain 7
- MYL4: Myosin light chain 4
- PDE4D: Phosphodiesterase 4D
Right Ventricle (RV)
Right Ventricle (RV)
- NEAT1: Nuclear paraspeckle assembly transcript 1
- MYH7: Myosin heavy chain 7
- FHL2: Four and a half LIM domains 2
- C15orf41: Chromosome 15 open reading frame 41
- PCDH7: Protocadherin 7
Left Atrium (LA)
Left Atrium (LA)
- NPPA: Atrial natriuretic peptide
- ELN: Elastin
- MYL7: Myosin light chain 7
- EBF2: Early B-cell factor 2
- RORA: RAR-related orphan receptor A
Left Ventricle (LV)
Left Ventricle (LV)
- CD36: Fatty acid translocase
- LINC00486: Long intergenic non-protein coding RNA
- FHL2: Four and a half LIM domains 2
- RP11-532N4.2: Long non-coding RNA
- MYH7: Myosin heavy chain 7
Preparing Your Own Data
Data Requirements
For optimal results with HeartMAP, your data should meet these criteria:Quality Control
- Minimum genes per cell: 200 (configurable)
- Minimum cells per gene: 3 (configurable)
- Quality filtered for doublets and low-quality cells
Data Format
- Stored as
.h5adfile - Raw or normalized counts in
.X - Cell metadata in
.obs - Gene names in
.var
Converting Other Formats
Example Datasets
Mock Data for Testing
HeartMAP includes functionality to generate mock data for testing:Data Size Guidelines
Choose appropriate dataset sizes based on your system:| System RAM | Recommended Cells | Recommended Genes | Analysis Time |
|---|---|---|---|
| 8GB | 10,000 | 2,000 | 5-10 minutes |
| 16GB | 30,000 | 4,000 | 10-20 minutes |
| 32GB | 50,000 | 5,000 | 20-30 minutes |
| 64GB+ | 100,000+ | 10,000+ | 30-60 minutes |
Data Integrity
Checksums
HeartMAP supports SHA-256 checksums for data verification:Reproducibility
All analyses use fixed random seeds to ensure reproducible results:- Random sampling: seed = 42
- Clustering algorithms: random_state = 42
- Communication analysis: seed = 42
Need Help?
If you’re having trouble with your data format:- Check the Troubleshooting guide
- Review the FAQ for common data questions
- Visit GitHub Issues for support