Skip to main content

Overview

TRIFID (TRanscript Isoform Functional IDentification) is a machine learning tool that predicts the functional potential of alternative splice isoforms. The system uses a Random Forest classifier trained on 45+ multi-dimensional features encompassing RNA-seq data, conservation scores, protein domain annotations, and structural predictions.
TRIFID predicts whether a transcript isoform is likely to be functional (capable of producing a stable, functional protein) or non-functional (likely degraded or producing non-functional proteins).

The ML Pipeline

TRIFID’s prediction workflow consists of several interconnected stages:

1. Data Loading and Feature Engineering

The pipeline begins by integrating data from multiple sources to create a comprehensive feature set:
# From trifid/data/feature_engineering.py:180
def load_data(config: dict, assembly: str, release: str) -> pd.DataFrame:
    # Loads multiple data sources:
    # - Genome annotation (GTF/GFF)
    # - APPRIS scores
    # - CORSAIR conservation
    # - QPfam domain analysis
    # - QSplice junction support
    # - PhyloCSF evolutionary scores
The system performs sophisticated feature transformations including:
  • Group Normalization: Features are normalized within each gene to capture relative differences between isoforms (see trifid/utils/utils.py:359)
  • Delta Scoring: Calculates length and score differences relative to the principal isoform (see trifid/utils/utils.py:127)
  • Fragment Correction: Adjusts scores for incomplete transcript fragments (see trifid/utils/utils.py:173)
  • One-Hot Encoding: Categorical features like TSL (Transcript Support Level) are encoded (see trifid/utils/utils.py:480)

2. Model Training

TRIFID uses a Random Forest Classifier from scikit-learn as its core predictive model:
# From trifid/models/train.py:86
custom_model = RandomForestClassifier(
    n_estimators=400,
    class_weight=None,
    max_features=7,
    min_samples_leaf=7,
    random_state=123
)
Random Forests are ideal for TRIFID because they:
  • Handle complex, non-linear relationships between features
  • Provide robust predictions without overfitting
  • Offer interpretability through feature importance scores
  • Work well with mixed data types (continuous and categorical)
  • Are resistant to outliers and missing values

3. Nested Cross-Validation

TRIFID employs a rigorous nested cross-validation strategy for model selection (see trifid/models/select.py:209):
  • Outer Loop: 5-fold stratified cross-validation for model evaluation
  • Inner Loop: 10-fold stratified cross-validation for hyperparameter tuning
  • Metric: Matthews Correlation Coefficient (MCC) as the primary selection criterion
This ensures the model generalizes well to unseen data and prevents overfitting.

4. Prediction Generation

Once trained, the model generates two key scores for each transcript:
# From trifid/utils/utils.py:267
def generate_trifid_metrics(df, features, model):
    # Raw TRIFID score (0-1 probability)
    df['trifid_score'] = model.predict_proba(features)[:, 1]
    
    # Normalized score within gene context
    df['norm_trifid_score'] = df.groupby('gene_id')['trifid_score'].transform(
        lambda x: (x) / (max(0.5, x.max()))
    )

TRIFID Scores Explained

  • trifid_score: Raw probability (0-1) that the isoform is functional
  • norm_trifid_score: Gene-normalized score comparing isoforms within the same gene
  • Scores ≥ 0.5 generally indicate functional isoforms
  • Normalization helps identify the most functional isoform per gene

Prediction Workflow

1

Input Preparation

Provide genome assembly, release version, and transcript annotations
2

Feature Extraction

TRIFID extracts 45+ features from multiple data sources
3

Feature Engineering

Normalization, delta scoring, and encoding transformations are applied
4

Model Inference

The trained Random Forest model predicts functional probability
5

Score Generation

Both raw and normalized TRIFID scores are calculated
6

Output

Results are saved with transcript identifiers and metadata

Handling Multiple Assemblies

TRIFID supports predictions across multiple genome assemblies and species (see trifid/models/predict.py:32):
  • Human: GRCh38, GRCh37 (both Ensembl and RefSeq)
  • Mouse: GRCm39, GRCm38
  • Rat: Rnor_6.0
  • Zebrafish: GRCz11
  • Pig: Sscrofa11.1
  • Chimp: Pan_tro_3.0
  • Other species: Chicken (GRCg6a), Cow (ARS-UCD1.2), Fly (BDGP6), Worm (WBcel235)
Feature availability varies by species. Some features (e.g., PhyloCSF, RNA2sj) are only available for certain assemblies. TRIFID handles missing features by imputing with appropriate default values (-1 or 0).

Training Data

The model is trained on curated isoform sets from GENCODE with experimentally validated functional status:
# From trifid/models/train.py:66
df_training_set = pd.read_csv('data/model/training_set_initial.g27.tsv.gz')
df_training_set.loc[df_training_set['state'].str.contains('F'), 'label'] = 1  # Functional
df_training_set.loc[df_training_set['state'].str.contains('U'), 'label'] = 0  # Unfunctional
Labels are derived from:
  • Proteomics evidence
  • APPRIS principal isoform annotations
  • Experimental validation studies

Performance Metrics

TRIFID evaluates model performance using comprehensive metrics (see trifid/models/select.py:178):
  • Accuracy: Overall prediction correctness
  • AUC: Area under the ROC curve
  • Balanced Accuracy: Accounts for class imbalance
  • F1 Score: Harmonic mean of precision and recall
  • MCC (Matthews Correlation Coefficient): Primary metric for model selection
  • Precision/Recall: Trade-off between false positives and false negatives
MCC is particularly valuable for TRIFID because it provides a balanced measure even when functional and non-functional isoforms are imbalanced in the dataset.

Next Steps

Model Architecture

Deep dive into Random Forest hyperparameters and model selection

Predictive Features

Explore all 45+ features used in predictions

Interpretability

Learn how SHAP values explain individual predictions

Quick Start

Start making predictions with TRIFID

Build docs developers (and LLMs) love