DataConfig class defines configuration parameters for data preprocessing, filtering, and quality control in HeartMAP.
Class Definition
Constructor
Configuration Fields
min_genes
Minimum number of genes expressed for a cell to be retained. Cells with fewer genes are filtered out during quality control.
min_cells
Minimum number of cells in which a gene must be expressed to be retained. Genes expressed in fewer cells are filtered out.
max_cells_subset
Maximum number of cells to use for analysis. If set, randomly subsamples cells to this number. Useful for limiting memory usage with large datasets. If
null, uses all cells.max_genes_subset
Maximum number of genes to use for analysis. If set, keeps only the top highly variable genes. Useful for limiting memory usage. If
null, uses all genes (after filtering).target_sum
Target sum for normalization. Each cell’s counts are normalized to this total count. Standard value is 10,000 (1e4).
n_top_genes
Number of highly variable genes to select for downstream analysis. These genes show the most variation across cells and are most informative for clustering.
random_seed
Random seed for reproducibility. Ensures consistent results across runs for stochastic operations like subsampling and PCA.
test_mode
Enable test mode for faster processing with reduced data. When
true, uses smaller subsets and fewer iterations for quick testing.Usage Examples
Default Configuration
Custom Configuration
Memory-Efficient Configuration
Quality Control Configuration
Using with Main Config
Loading from YAML
Best Practices
Quality Control
- min_genes: Use 200-500 depending on data quality. Higher values filter out more low-quality cells.
- min_cells: Use 3-10. Higher values remove rare genes but may filter out important markers.
Memory Management
- Set max_cells_subset when working with datasets > 100k cells
- Set max_genes_subset when memory is limited (< 16 GB RAM)
- Both parameters use random subsampling, so set random_seed for reproducibility
Feature Selection
- n_top_genes: 2000-3000 works well for most datasets
- Increase for heterogeneous datasets with many cell types
- Decrease for homogeneous datasets or memory constraints
Normalization
- target_sum: 1e4 (10,000) is standard
- Some protocols use 1e6 for compatibility with TPM units
Testing
- Enable test_mode during development for faster iteration
- Always disable for production analysis
See Also
- Config - Main configuration class
- AnalysisConfig - Analysis configuration
- ModelConfig - Model configuration