Basic Analysis with BasicPipeline
TheBasicPipeline provides essential single-cell analysis capabilities including quality control, normalization, clustering, and cell type annotation.
Overview
The BasicPipeline performs:- Data loading and quality control filtering
- Normalization and scaling
- PCA and neighborhood graph computation
- Leiden clustering for cell type identification
- UMAP visualization
- Quality metrics reporting
import scanpy as sc
# Load your data
adata = sc.read_h5ad('your_data.h5ad')
print(f"Loaded {adata.n_obs:,} cells × {adata.n_vars:,} genes")
from heartmap import Config
# Load default configuration
config = Config.default()
# Customize parameters
config.data.min_genes = 200 # Minimum genes per cell
config.data.min_cells = 3 # Minimum cells per gene
config.data.max_cells_subset = 50000 # For memory optimization
config.analysis.resolution = 0.5 # Clustering resolution
# Update output paths
config.update_paths('./my_analysis')
config.create_directories()
from heartmap.pipelines import BasicPipeline
# Initialize the pipeline
pipeline = BasicPipeline(config)
# Run complete analysis
results = pipeline.run(
data_path='data/raw/heart_data.h5ad',
output_dir='results/basic'
)
print("Basic pipeline completed!")
# Access the processed AnnData object
adata = results['adata']
# View cluster assignments
cluster_labels = results['results']['cluster_labels']
print(f"Identified {len(set(cluster_labels))} cell clusters")
# Examine cluster distribution
import pandas as pd
cluster_counts = pd.Series(adata.obs['leiden']).value_counts()
print("\nCluster sizes:")
print(cluster_counts)
from pathlib import Path
import matplotlib.pyplot as plt
from matplotlib.image import imread
# Load and display UMAP
fig_path = Path('results/basic/figures/umap_clusters.png')
if fig_path.exists():
img = imread(fig_path)
plt.figure(figsize=(10, 8))
plt.imshow(img)
plt.axis('off')
plt.title('Cell Type Clusters')
plt.show()
import scanpy as sc
# QC metrics are stored in adata.obs
print("QC Metrics:")
print(f"Mean genes per cell: {adata.obs['n_genes'].mean():.0f}")
print(f"Mean UMI per cell: {adata.obs['total_counts'].mean():.0f}")
if 'pct_counts_mt' in adata.obs.columns:
print(f"Mean mitochondrial %: {adata.obs['pct_counts_mt'].mean():.2f}%")
# Visualize QC distributions
sc.pl.violin(adata, ['n_genes', 'total_counts'],
jitter=0.4, multi_panel=True)
# Annotated data is saved automatically
# Load it for downstream analysis:
adata_annotated = sc.read_h5ad('results/basic/annotated_data.h5ad')
# Export cluster assignments to CSV
cluster_df = pd.DataFrame({
'cell_id': adata.obs_names,
'cluster': adata.obs['leiden'],
'n_genes': adata.obs['n_genes'],
'n_counts': adata.obs['total_counts']
})
cluster_df.to_csv('results/basic/cluster_assignments.csv', index=False)
Complete Working Example
Here’s a complete script from data loading to visualization:Expected Output Structure
The BasicPipeline creates the following output structure:Configuration Options
Minimum genes per cell for QC filtering
Minimum cells per gene for QC filtering
Maximum cells to keep (for memory optimization)
Target sum for normalization
Leiden clustering resolution (higher = more clusters)
Number of principal components to use
Number of neighbors for graph construction
Best Practices
Memory Management
For large datasets (>100K cells), adjust
max_cells_subset to fit your RAM:- 8GB RAM: Use
max_cells_subset=10000 - 16GB RAM: Use
max_cells_subset=30000 - 32GB+ RAM: Use
max_cells_subset=50000or higher
Clustering Resolution
The
resolution parameter controls cluster granularity:- Low (0.2-0.5): Broader cell types
- Medium (0.5-1.0): Standard cell types
- High (1.0-2.0): Fine-grained subtypes
Quality Control
Adjust QC thresholds based on your tissue:
min_genes: 200 (standard), 500 (strict)- Filter high mitochondrial % cells (>20%) for better quality
Common Issues
Pipeline fails with memory error
Pipeline fails with memory error
Reduce
max_cells_subset and max_genes_subset in the config:Too many/few clusters identified
Too many/few clusters identified
Adjust the clustering resolution:
UMAP not computed
UMAP not computed
The pipeline computes UMAP automatically. If missing, run manually:
Next Steps
Communication Analysis
Analyze cell-cell interactions
Multi-Chamber Analysis
Chamber-specific patterns
API Reference
Detailed API documentation