Overview
Thefeature_engineering module provides functions to transform raw genomic data into normalized, machine learning-ready features. It handles normalization, imputation, categorical encoding, and dataset merging.
Core Functions
build_features
Orchestrate the complete feature engineering pipeline.Initial DataFrame with raw genomic features from multiple data sources
pd.DataFrame
Processed DataFrame with normalized features ready for modeling.
Pipeline Steps:
- Calculate delta scores for length features
- Build categorical features (one-hot encoding)
- Build APPRIS features (group normalization)
- Build ALT-Corsair features (normalized conservation)
- Build PhyloCSF features (evolutionary scores)
- Build QPfam features (domain impacts)
- Build QSplice features (splice junction support)
- Apply unity range normalization to all
norm_*features - Reorder columns for consistency
load_data
Load and merge genomic data from multiple annotation sources.Configuration dictionary containing file paths for all data sources.
Structure:
Genome assembly version (e.g.,
"GRCh38", "GRCh37")Annotation release identifier (e.g.,
"g27" for GENCODE v27, "e90" for Ensembl 90)pd.DataFrame
Merged DataFrame containing all loaded data sources joined by transcript_id. PAR (Pseudoautosomal Region) transcripts are automatically removed.
Loaded Datasets:
- Genome annotations (GTF)
- APPRIS functional scores
- ALT-Corsair conservation scores
- QPfam domain scores
- QSplice junction coverage
- PhyloCSF evolutionary scores
- Reference transcript annotations
- Protein sequences
"-", an empty DataFrame with expected columns is created.
Feature Builder Functions
build_appris_features
Normalize APPRIS functional module scores within gene groups.DataFrame containing APPRIS features:
crash_p, crash_m, firestar, matador3d, spade, thump, corsairpd.DataFrame
DataFrame with added normalized columns: norm_crash_p, norm_crash_m, norm_firestar, norm_matador3d, norm_spade, norm_thump, norm_corsair
Normalization:
- Standard features: Min-max normalization within each gene group
- Corsair: Normalized with max cap at 4 (cross-species conservation)
build_categorical_features
One-hot encode categorical annotation features.DataFrame containing categorical features:
tsl (transcript support level), level (annotation confidence)pd.DataFrame
DataFrame with one-hot encoded columns:
tsl_0throughtsl_6: Transcript support levels (1-5, plus 6 for NA)level_0throughlevel_3: GENCODE annotation confidence levels
build_corsair_alt_features
Normalize ALT-Corsair alternative isoform conservation scores.DataFrame containing
corsair_alt featurepd.DataFrame
DataFrame with added norm_corsair_alt column normalized within gene groups (max cap at 0.25).
build_phylocsf_features
Normalize and impute PhyloCSF evolutionary conservation scores.DataFrame containing PhyloCSF features:
ScorePerCodon, RelBranchLength, PhyloCSF_Psipd.DataFrame
DataFrame with normalized columns: norm_ScorePerCodon, norm_RelBranchLength, norm_PhyloCSF_Psi
Processing:
- Fill missing values with gene-level minimum
- Impute remaining NaNs at 3rd percentile
- Group-normalize within genes
- Apply fragment correction for partial transcripts
build_qpfam_featues
Normalize QPfam domain loss/gain scores and compute SPADE loss.DataFrame containing QPfam features:
Lost_residues_pfam, Gain_residues_pfam, spade, gene_namepd.DataFrame
DataFrame with normalized features:
norm_Lost_residues_pfam: Normalized lost residues (values below 10 set to 0)norm_Gain_residues_pfam: Normalized gained residuesnorm_spade_loss: Inverted SPADE loss (max - current, capped at 50)
Lost_residues_pfam and Gain_residues_pfam columns are dropped, keeping only normalized values.
build_qsplice_features
Normalize QSplice junction support scores.DataFrame containing QSplice features:
RNA2sj, RNA2sj_cdspd.DataFrame
DataFrame with normalized columns:
norm_RNA2sj: Normalized RNA-to-splice junction scorenorm_RNA2sj_cds: Normalized CDS-specific junction score
- Fill NaN values with 0
- Group-normalize within genes
- Apply fragment correction
Complete Pipeline Example
Normalization Utilities
The feature engineering functions rely on utility functions fromtrifid.utils.utils:
group_normalization: Min-max normalization within gene groupsone_hot_encoding: Convert categorical variables to binary columnsimpute: Fill missing values using percentile or other strategiesfragments_correction: Adjust scores for partial transcriptsdelta_score: Calculate difference from maximum within groupunity_ranger: Ensure all normalized features are in [0, 1] rangemerge_dataframes: Left join multiple DataFrames on transcript_id
Feature Categories
Functional Annotations (APPRIS):norm_firestar: Functional residue annotationsnorm_matador3d: 3D structure predictionsnorm_spade: Domain predictionsnorm_thump: Transmembrane predictionsnorm_crash_p,norm_crash_m: Signal peptide scores
norm_corsair: Cross-species alignmentnorm_corsair_alt: Alternative isoform conservationnorm_ScorePerCodon,norm_PhyloCSF_Psi: Evolutionary scores
norm_RNA2sj,norm_RNA2sj_cds: Splice junction coveragetsl_1throughtsl_5: Transcript support levels
norm_Lost_residues_pfam,norm_Gain_residues_pfam: Pfam domain changesnorm_spade_loss: Domain prediction delta
CCDS,basic,StartEnd_NF: Binary annotation flagsnonsense_mediated_decay: NMD predictionlevel_1throughlevel_3: Annotation confidence