Feature Engineering

Overview

The feature_engineering module provides functions to transform raw genomic data into normalized, machine learning-ready features. It handles normalization, imputation, categorical encoding, and dataset merging.

Core Functions

build_features

Orchestrate the complete feature engineering pipeline.

from trifid.data.feature_engineering import build_features

df_processed = build_features(df_raw)

pd.DataFrame

required

Initial DataFrame with raw genomic features from multiple data sources

Returns: pd.DataFrame Processed DataFrame with normalized features ready for modeling. Pipeline Steps:

Calculate delta scores for length features
Build categorical features (one-hot encoding)
Build APPRIS features (group normalization)
Build ALT-Corsair features (normalized conservation)
Build PhyloCSF features (evolutionary scores)
Build QPfam features (domain impacts)
Build QSplice features (splice junction support)
Apply unity range normalization to all norm_* features
Reorder columns for consistency

Example:

import pandas as pd
from trifid.data.feature_engineering import load_data, build_features

# Load raw data
config = utils.parse_yaml("config/config.yaml")
df_raw = load_data(config, assembly="GRCh38", release="g27")

# Process features
df_features = build_features(df_raw)

# Result contains normalized features prefixed with 'norm_'
print(df_features.filter(regex='^norm_').columns)
# Output: ['norm_crash_p', 'norm_crash_m', 'norm_firestar', ...]

load_data

Load and merge genomic data from multiple annotation sources.

from trifid.data.feature_engineering import load_data
import yaml

with open("config/config.yaml") as f:
    config = yaml.safe_load(f)

df = load_data(
    config=config,
    assembly="GRCh38",
    release="g27"
)

config

dict

required

Configuration dictionary containing file paths for all data sources. Structure:

{
  "genomes": {
    "GRCh38": {
      "g27": {
        "annotation": "path/to/annotation.gtf.gz",
        "appris_data": "path/to/appris_data.txt",
        "corsair_alt": "path/to/corsair_alt.tsv.gz",
        "qpfam": "path/to/qpfam.tsv.gz",
        "qsplice": "path/to/qsplice.tsv.gz",
        "phylocsf": "path/to/phylocsf.tsv.gz",
        "reference": "path/to/reference.tsv.gz",
        "sequences": "path/to/sequences.fa.gz"
      }
    }
  }
}

assembly

str

required

Genome assembly version (e.g., "GRCh38", "GRCh37")

release

str

required

Annotation release identifier (e.g., "g27" for GENCODE v27, "e90" for Ensembl 90)

Returns: pd.DataFrame Merged DataFrame containing all loaded data sources joined by transcript_id. PAR (Pseudoautosomal Region) transcripts are automatically removed. Loaded Datasets:

Genome annotations (GTF)
APPRIS functional scores
ALT-Corsair conservation scores
QPfam domain scores
QSplice junction coverage
PhyloCSF evolutionary scores
Reference transcript annotations
Protein sequences

Graceful Fallback: If a data source path contains "-", an empty DataFrame with expected columns is created.

Feature Builder Functions

build_appris_features

Normalize APPRIS functional module scores within gene groups.

from trifid.data.feature_engineering import build_appris_features

df = build_appris_features(df)

pd.DataFrame

required

DataFrame containing APPRIS features: crash_p, crash_m, firestar, matador3d, spade, thump, corsair

Returns: pd.DataFrame DataFrame with added normalized columns: norm_crash_p, norm_crash_m, norm_firestar, norm_matador3d, norm_spade, norm_thump, norm_corsair Normalization:

Standard features: Min-max normalization within each gene group
Corsair: Normalized with max cap at 4 (cross-species conservation)

build_categorical_features

One-hot encode categorical annotation features.

from trifid.data.feature_engineering import build_categorical_features

df = build_categorical_features(df)

pd.DataFrame

required

DataFrame containing categorical features: tsl (transcript support level), level (annotation confidence)

Returns: pd.DataFrame DataFrame with one-hot encoded columns:

tsl_0 through tsl_6: Transcript support levels (1-5, plus 6 for NA)
level_0 through level_3: GENCODE annotation confidence levels

Handling Missing Data: If a categorical column is empty, NaN-filled columns are created for all categories.

build_corsair_alt_features

Normalize ALT-Corsair alternative isoform conservation scores.

from trifid.data.feature_engineering import build_corsair_alt_features

df = build_corsair_alt_features(df)

pd.DataFrame

required

DataFrame containing corsair_alt feature

Returns: pd.DataFrame DataFrame with added norm_corsair_alt column normalized within gene groups (max cap at 0.25).

build_phylocsf_features

Normalize and impute PhyloCSF evolutionary conservation scores.

from trifid.data.feature_engineering import build_phylocsf_features

df = build_phylocsf_features(df)

pd.DataFrame

required

DataFrame containing PhyloCSF features: ScorePerCodon, RelBranchLength, PhyloCSF_Psi

Returns: pd.DataFrame DataFrame with normalized columns: norm_ScorePerCodon, norm_RelBranchLength, norm_PhyloCSF_Psi Processing:

Fill missing values with gene-level minimum
Impute remaining NaNs at 3rd percentile
Group-normalize within genes
Apply fragment correction for partial transcripts

build_qpfam_featues

Normalize QPfam domain loss/gain scores and compute SPADE loss.

from trifid.data.feature_engineering import build_qpfam_featues

df = build_qpfam_featues(df)

pd.DataFrame

required

DataFrame containing QPfam features: Lost_residues_pfam, Gain_residues_pfam, spade, gene_name

Returns: pd.DataFrame DataFrame with normalized features:

norm_Lost_residues_pfam: Normalized lost residues (values below 10 set to 0)
norm_Gain_residues_pfam: Normalized gained residues
norm_spade_loss: Inverted SPADE loss (max - current, capped at 50)

Note: Raw Lost_residues_pfam and Gain_residues_pfam columns are dropped, keeping only normalized values.

build_qsplice_features

Normalize QSplice junction support scores.

from trifid.data.feature_engineering import build_qsplice_features

df = build_qsplice_features(df)

pd.DataFrame

required

DataFrame containing QSplice features: RNA2sj, RNA2sj_cds

Returns: pd.DataFrame DataFrame with normalized columns:

norm_RNA2sj: Normalized RNA-to-splice junction score
norm_RNA2sj_cds: Normalized CDS-specific junction score

Processing:

Fill NaN values with 0
Group-normalize within genes
Apply fragment correction

Complete Pipeline Example

import yaml
from trifid.data.feature_engineering import load_data, build_features
from trifid.utils import utils
import pandas as pd

# Load configuration
config = utils.parse_yaml("config/config.yaml")

# Load all data sources
df_raw = load_data(
    config=config,
    assembly="GRCh38",
    release="g27"
)

print(f"Loaded {len(df_raw)} transcripts")
print(f"Raw columns: {df_raw.columns.tolist()}")

# Apply feature engineering
df_processed = build_features(df_raw)

# Check normalized features
norm_features = [col for col in df_processed.columns if col.startswith('norm_')]
print(f"Generated {len(norm_features)} normalized features")

# Select specific features for modeling
feature_config = utils.parse_yaml("config/features.yaml")
selected_features = feature_config['features']
df_model = df_processed[selected_features]

# Save processed dataset
df_model.to_csv(
    "data/trifid_processed.tsv.gz",
    sep="\t",
    compression="gzip",
    index=False
)

Normalization Utilities

The feature engineering functions rely on utility functions from trifid.utils.utils:

group_normalization: Min-max normalization within gene groups
one_hot_encoding: Convert categorical variables to binary columns
impute: Fill missing values using percentile or other strategies
fragments_correction: Adjust scores for partial transcripts
delta_score: Calculate difference from maximum within group
unity_ranger: Ensure all normalized features are in [0, 1] range
merge_dataframes: Left join multiple DataFrames on transcript_id

Feature Categories

Functional Annotations (APPRIS):

norm_firestar: Functional residue annotations
norm_matador3d: 3D structure predictions
norm_spade: Domain predictions
norm_thump: Transmembrane predictions
norm_crash_p, norm_crash_m: Signal peptide scores

Conservation (Cross-species):

norm_corsair: Cross-species alignment
norm_corsair_alt: Alternative isoform conservation
norm_ScorePerCodon, norm_PhyloCSF_Psi: Evolutionary scores

Experimental Support:

norm_RNA2sj, norm_RNA2sj_cds: Splice junction coverage
tsl_1 through tsl_5: Transcript support levels

Domain Structure:

norm_Lost_residues_pfam, norm_Gain_residues_pfam: Pfam domain changes
norm_spade_loss: Domain prediction delta

Quality Flags:

CCDS, basic, StartEnd_NF: Binary annotation flags
nonsense_mediated_decay: NMD prediction
level_1 through level_3: Annotation confidence

Preprocessing

Models

Data

Utils

Visualization

Feature Engineering

Overview

Core Functions

build_features

load_data

Feature Builder Functions

build_appris_features

build_categorical_features

build_corsair_alt_features

build_phylocsf_features

build_qpfam_featues

build_qsplice_features

Complete Pipeline Example

Normalization Utilities

Feature Categories

Build docs developers (and LLMs) love

Preprocessing

Models

Data

Utils

Visualization

​Overview

​Core Functions

​build_features

​load_data

​Feature Builder Functions

​build_appris_features

​build_categorical_features

​build_corsair_alt_features

​build_phylocsf_features

​build_qpfam_featues

​build_qsplice_features

​Complete Pipeline Example

​Normalization Utilities

​Feature Categories

Build docs developers (and LLMs) love

Overview

Core Functions

build_features

load_data

Feature Builder Functions

build_appris_features

build_categorical_features

build_corsair_alt_features

build_phylocsf_features

build_qpfam_featues

build_qsplice_features

Complete Pipeline Example

Normalization Utilities

Feature Categories