How TRIFID Works

Overview

TRIFID (TRanscript Isoform Functional IDentification) is a machine learning tool that predicts the functional potential of alternative splice isoforms. The system uses a Random Forest classifier trained on 45+ multi-dimensional features encompassing RNA-seq data, conservation scores, protein domain annotations, and structural predictions.

TRIFID predicts whether a transcript isoform is likely to be functional (capable of producing a stable, functional protein) or non-functional (likely degraded or producing non-functional proteins).

The ML Pipeline

TRIFID’s prediction workflow consists of several interconnected stages:

1. Data Loading and Feature Engineering

The pipeline begins by integrating data from multiple sources to create a comprehensive feature set:

# From trifid/data/feature_engineering.py:180
def load_data(config: dict, assembly: str, release: str) -> pd.DataFrame:
    # Loads multiple data sources:
    # - Genome annotation (GTF/GFF)
    # - APPRIS scores
    # - CORSAIR conservation
    # - QPfam domain analysis
    # - QSplice junction support
    # - PhyloCSF evolutionary scores

The system performs sophisticated feature transformations including:

Group Normalization: Features are normalized within each gene to capture relative differences between isoforms (see trifid/utils/utils.py:359)
Delta Scoring: Calculates length and score differences relative to the principal isoform (see trifid/utils/utils.py:127)
Fragment Correction: Adjusts scores for incomplete transcript fragments (see trifid/utils/utils.py:173)
One-Hot Encoding: Categorical features like TSL (Transcript Support Level) are encoded (see trifid/utils/utils.py:480)

2. Model Training

TRIFID uses a Random Forest Classifier from scikit-learn as its core predictive model:

# From trifid/models/train.py:86
custom_model = RandomForestClassifier(
    n_estimators=400,
    class_weight=None,
    max_features=7,
    min_samples_leaf=7,
    random_state=123
)

Why Random Forest?

Random Forests are ideal for TRIFID because they:

Handle complex, non-linear relationships between features
Provide robust predictions without overfitting
Offer interpretability through feature importance scores
Work well with mixed data types (continuous and categorical)
Are resistant to outliers and missing values

3. Nested Cross-Validation

TRIFID employs a rigorous nested cross-validation strategy for model selection (see trifid/models/select.py:209):

Outer Loop: 5-fold stratified cross-validation for model evaluation
Inner Loop: 10-fold stratified cross-validation for hyperparameter tuning
Metric: Matthews Correlation Coefficient (MCC) as the primary selection criterion

This ensures the model generalizes well to unseen data and prevents overfitting.

4. Prediction Generation

Once trained, the model generates two key scores for each transcript:

# From trifid/utils/utils.py:267
def generate_trifid_metrics(df, features, model):
    # Raw TRIFID score (0-1 probability)
    df['trifid_score'] = model.predict_proba(features)[:, 1]
    
    # Normalized score within gene context
    df['norm_trifid_score'] = df.groupby('gene_id')['trifid_score'].transform(
        lambda x: (x) / (max(0.5, x.max()))
    )

TRIFID Scores Explained

trifid_score: Raw probability (0-1) that the isoform is functional
norm_trifid_score: Gene-normalized score comparing isoforms within the same gene
Scores ≥ 0.5 generally indicate functional isoforms
Normalization helps identify the most functional isoform per gene

Prediction Workflow

Input Preparation

Provide genome assembly, release version, and transcript annotations

Feature Extraction

TRIFID extracts 45+ features from multiple data sources

Feature Engineering

Normalization, delta scoring, and encoding transformations are applied

Model Inference

The trained Random Forest model predicts functional probability

Score Generation

Both raw and normalized TRIFID scores are calculated

Output

Results are saved with transcript identifiers and metadata

Handling Multiple Assemblies

TRIFID supports predictions across multiple genome assemblies and species (see trifid/models/predict.py:32):

Human: GRCh38, GRCh37 (both Ensembl and RefSeq)
Mouse: GRCm39, GRCm38
Rat: Rnor_6.0
Zebrafish: GRCz11
Pig: Sscrofa11.1
Chimp: Pan_tro_3.0
Other species: Chicken (GRCg6a), Cow (ARS-UCD1.2), Fly (BDGP6), Worm (WBcel235)

Feature availability varies by species. Some features (e.g., PhyloCSF, RNA2sj) are only available for certain assemblies. TRIFID handles missing features by imputing with appropriate default values (-1 or 0).

Training Data

The model is trained on curated isoform sets from GENCODE with experimentally validated functional status:

# From trifid/models/train.py:66
df_training_set = pd.read_csv('data/model/training_set_initial.g27.tsv.gz')
df_training_set.loc[df_training_set['state'].str.contains('F'), 'label'] = 1  # Functional
df_training_set.loc[df_training_set['state'].str.contains('U'), 'label'] = 0  # Unfunctional

Labels are derived from:

Proteomics evidence
APPRIS principal isoform annotations
Experimental validation studies

Performance Metrics

TRIFID evaluates model performance using comprehensive metrics (see trifid/models/select.py:178):

Accuracy: Overall prediction correctness
AUC: Area under the ROC curve
Balanced Accuracy: Accounts for class imbalance
F1 Score: Harmonic mean of precision and recall
MCC (Matthews Correlation Coefficient): Primary metric for model selection
Precision/Recall: Trade-off between false positives and false negatives

MCC is particularly valuable for TRIFID because it provides a balanced measure even when functional and non-functional isoforms are imbalanced in the dataset.

Next Steps

Model Architecture

Deep dive into Random Forest hyperparameters and model selection

Predictive Features

Explore all 45+ features used in predictions

Interpretability

Learn how SHAP values explain individual predictions

Quick Start

Start making predictions with TRIFID

Get Started

Core Concepts

User Guides

TRIFID Modules

Data & Models

How TRIFID Works

Overview

The ML Pipeline

1. Data Loading and Feature Engineering

2. Model Training

3. Nested Cross-Validation

4. Prediction Generation

TRIFID Scores Explained

Prediction Workflow

Handling Multiple Assemblies

Training Data

Performance Metrics

Next Steps

Model Architecture

Predictive Features

Interpretability

Quick Start

Build docs developers (and LLMs) love

Get Started

Core Concepts

User Guides

TRIFID Modules

Data & Models

​Overview

​The ML Pipeline

​1. Data Loading and Feature Engineering

​2. Model Training

​3. Nested Cross-Validation

​4. Prediction Generation

TRIFID Scores Explained

​Prediction Workflow

​Handling Multiple Assemblies

​Training Data

​Performance Metrics

​Next Steps

Model Architecture

Predictive Features

Interpretability

Quick Start

Build docs developers (and LLMs) love

Overview

The ML Pipeline

1. Data Loading and Feature Engineering

2. Model Training

3. Nested Cross-Validation

4. Prediction Generation

Prediction Workflow

Handling Multiple Assemblies

Training Data

Performance Metrics

Next Steps