Overview
DataProcessor provides a complete pipeline for processing single-cell RNA-seq data, from raw file loading through normalization, quality control, and dimensionality reduction. It integrates DataLoader and DataValidator to handle the entire preprocessing workflow.
Class: DataProcessor
Constructor
Configuration object containing data processing parameters, file paths, and analysis settings
Attributes
Configuration object used for all processing operations
Internal DataLoader instance for loading and preprocessing operations
Methods
process_from_raw
Complete processing pipeline from raw data file to analysis-ready AnnData object.Path to raw data file. Supported formats:
.h5ad- Scanpy/AnnData format.h5- 10x Genomics HDF5 format.csv- CSV matrix (genes as rows)
Whether to save intermediate processing steps:
preprocessed.h5ad- After basic filteringqc_calculated.h5ad- After QC metricsscaled.h5ad- After memory scalingnormalized.h5ad- After normalizationprocessed_with_neighbors.h5ad- Final output
Processed AnnData object with:
- Normalized and log-transformed expression values in
.X - Quality control metrics in
.obs - PCA coordinates in
.obsm['X_pca'] - Neighborhood graph in
.obsp['connectivities']
Pipeline Steps
- Load raw data - Read file and validate structure
- Basic preprocessing - Filter low-quality cells/genes (min_genes, min_cells from config)
- QC calculation - Compute mitochondrial, ribosomal, hemoglobin percentages
- Memory scaling - Subsample if dataset exceeds max_cells_subset or max_genes_subset
- Normalization - Normalize to target_sum, log1p transform
- Sanitization - Remove NaN/Inf values before PCA
- PCA - Compute principal components (arpack solver)
- Neighbors - Build k-nearest neighbors graph (n_neighbors=15, n_pcs=40)
Example
create_test_dataset
Create a small test dataset by random subsampling.Full dataset to subsample from
Number of cells to sample (will not exceed dataset size)
Subset AnnData object with randomly sampled cells. Uses random seed from config for reproducibility.
Example
Helper Classes
DataValidator
Validate data integrity and format.Methods
verify_checksumPath to file to verify
Expected SHA256 checksum
True if checksums match, False otherwise
AnnData object to validate
Whether to check for QC metric columns (n_genes_by_counts, total_counts)
- First element: True if valid, False if issues found
- Second element: List of validation issue descriptions
DataLoader
Load and preprocess data with fine-grained control.Methods
load_raw_dataPath to data file (.h5ad, .h5, or .csv)
Run validation checks after loading
adata.obs:
n_genes_by_counts,total_countspct_counts_mt,pct_counts_ribo,pct_counts_hb
max_cells_subset or max_genes_subset from config.
normalize_and_scale
Configuration
Key configuration parameters fromConfig object:
File Formats
Input Formats
| Format | Extension | Description | Reader |
|---|---|---|---|
| AnnData | .h5ad | Scanpy native format | sc.read_h5ad() |
| 10x HDF5 | .h5 | 10x Genomics output | sc.read_10x_h5() |
| CSV | .csv | Gene × Cell matrix | sc.read_csv() |
Output Structure
Processed AnnData contains:Error Handling
Common exceptions:Best Practices
- Always use configuration objects - Don’t hardcode parameters
- Save intermediate files - Enable debugging of pipeline stages
- Check validation issues - Review warnings before analysis
- Use test datasets - Prototype on small subsets first
- Monitor memory - Set max_cells_subset for large datasets
See Also
- LigandReceptorDatabase - Load L-R interaction databases
- Config - Configuration management
- Scanpy documentation - Underlying preprocessing library