Overview
HeartMAP uses a centralized YAML-based configuration system that controls all aspects of data processing, analysis, and model behavior. This allows you to customize behavior without modifying code.Configuration Architecture
The configuration is organized into four main sections:Data Config
Quality control, filtering, normalization, and preprocessing parameters
Analysis Config
Clustering, dimensionality reduction, and marker gene detection settings
Model Config
Pipeline selection, memory limits, and GPU usage
Path Config
Input/output directory structure
Configuration Classes
HeartMAP uses Python dataclasses for type-safe configuration:Data Configuration
Controls data preprocessing and quality control.DataConfig Parameters
Minimum number of genes expressed per cell (for QC filtering)Usage: Filters low-quality cells with few detected genes.Typical range: 100-500
- Lower (100-200): Permissive, keeps more cells
- Higher (300-500): Stringent, removes low-quality cells
Minimum number of cells expressing a gene (for gene filtering)Usage: Removes genes detected in very few cells.Typical range: 3-10
- Lower (3): Keep rare genes
- Higher (5-10): Focus on commonly expressed genes
Maximum number of cells to analyze (random subset if exceeded)Usage: Memory optimization for large datasets.Recommendations by RAM:
- 8GB RAM: 10,000-15,000
- 16GB RAM: 30,000-40,000
- 32GB RAM: 50,000-60,000
- 64GB+ RAM: 100,000+
Maximum number of genes to analyze (keeps most variable)Usage: Further memory reduction and noise filtering.Recommendations:
- 8GB RAM: 2,000-3,000
- 16GB RAM: 4,000-5,000
- 32GB+ RAM: 5,000-10,000
Genes are selected based on variability after n_top_genes filtering
Target sum for normalization (counts per cell after normalization)Usage: Normalizes library size differences between cells.Standard: 10,000 (1e4) is the scanpy defaultAlternative: 1,000,000 (1e6) for CPM (counts per million)
Number of highly variable genes to selectUsage: Focuses analysis on informative genes.Typical range: 1,000-5,000
- Lower (1,000-2,000): Fast, focuses on most variable
- Higher (3,000-5,000): Comprehensive, captures more biology
Random seed for reproducibilityUsage: Ensures identical results across runs.
Keep at 42 for consistency with published HeartMAP results
Enable test mode (uses minimal data for quick validation)Usage: Fast testing and debugging.When true:
- Uses only 1,000 cells
- Uses only 500 genes
- Skips some visualizations
Example: Data Config
Analysis Configuration
Controls clustering, dimensionality reduction, and marker detection.AnalysisConfig Parameters
Number of principal components to computeUsage: Dimensionality reduction before clustering.Typical range: 30-100
- Lower (30-50): Faster, may miss subtle patterns
- Higher (50-100): Captures more variance
Number of neighbors for graph constructionUsage: Neighborhood graph for clustering and UMAP.Typical range: 10-30
- Lower (5-10): More granular clusters
- Higher (20-30): Broader clusters
Number of PCs to use for neighborhood graphUsage: How many PCs to use from PCA (must be ≤ n_components_pca).Typical range: 20-50
- Lower (20-30): Focuses on major variation
- Higher (40-50): Includes subtle variation
Leiden clustering resolution parameterUsage: Controls cluster granularity.Typical range: 0.1-2.0
- Lower (0.1-0.5): Fewer, broader clusters
- Higher (0.8-2.0): Many small clusters
Number of marker genes to identify per cluster/chamberUsage: Marker gene detection and annotation.Typical range: 10-50
- Lower (10-20): Top markers only
- Higher (30-50): Comprehensive marker list
Use Leiden algorithm (vs. Louvain)Usage: Leiden is generally preferred over Louvain.Recommendation: Keep as
trueUse LIANA for L-R analysis (if available)Usage: Enables LIANA integration for communication analysis.Fallback: Uses built-in L-R database if LIANA not installed
Example: Analysis Config
Model Configuration
Controls pipeline behavior and resource usage.ModelConfig Parameters
Type of analysis pipeline to runOptions:
"basic": BasicPipeline"communication": AdvancedCommunicationPipeline"multi_chamber": MultiChamberPipeline"comprehensive": ComprehensivePipeline
Save intermediate results during analysisUsage: Enables resuming from checkpoints and debugging.When true: Saves data after each major step
When false: Only saves final results (saves disk space)
Use GPU acceleration (if available)Usage: Speeds up certain operations (PCA, neighbors).Requirements:
- CUDA-enabled GPU
- rapids-singlecell installed
Most users should keep as
false (CPU is sufficient)Batch size for processing (future use)Usage: Reserved for batch processing features.Current: Not actively used, keep as
nullMaximum memory usage in GBUsage: Automatic memory limit enforcement (future feature).Current: Not actively used, manage via
max_cells_subset insteadExample: Model Config
Path Configuration
Defines directory structure for data and results.PathConfig Parameters
Base directory for all data
Directory for raw input files (.h5ad)
Directory for processed/intermediate data
Directory for analysis results
Directory for generated plots and visualizations
Example: Path Config
Complete Configuration
The mainConfig class combines all four configuration sections.
Default Configuration
This is the complete default configuration shipped with HeartMAP:config.yaml
Loading Configuration
Saving Configuration
Memory Optimization Presets
Pre-configured settings for different hardware:- 8GB RAM (Laptop)
- 16GB RAM (Workstation)
- 32GB RAM (Server)
- 64GB+ RAM (HPC)
config.yaml
- Memory: ~6GB
- Runtime: 5-10 minutes
- Suitable for: Initial exploration, testing
Common Configuration Patterns
Quick Test Run
config_test.yaml
Communication-Focused Analysis
config_comm.yaml
Chamber-Focused Analysis
config_chamber.yaml
High-Resolution Clustering
config_highres.yaml
Configuration in Practice
Example Workflow
Reproducible Research
Troubleshooting
Out of Memory Errors
Out of Memory Errors
Symptoms: Python crashes, “MemoryError”, system freezesSolutions:
- Reduce
max_cells_subset(e.g., from 50000 to 30000) - Reduce
max_genes_subset(e.g., from 5000 to 3000) - Reduce
n_top_genes(e.g., from 2000 to 1500) - Set
save_intermediate: falseto reduce disk I/O - Close other applications
Too Many/Few Clusters
Too Many/Few Clusters
Too many clusters:Too few clusters:
Slow Performance
Slow Performance
Solutions:
- Enable test mode for quick validation
- Reduce dataset size
- Use fewer PCA components
Configuration File Not Found
Configuration File Not Found
Error:
FileNotFoundError: config.yamlSolutions:- Use absolute path:
- Use default configuration:
Next Steps
Quick Start
Apply configuration in your first analysis
Pipelines
Understand how pipelines use configuration
API Reference
Complete configuration API documentation
Examples
See configuration examples for different use cases
Best Practices
Do’s:
- Start with default configuration
- Adjust memory settings for your hardware
- Save configuration files with results
- Use version control for config files
- Document why you changed parameters
- Test with
test_mode: truefirst