Overview
TRIFID (TRanscript Isoform Functional IDentification) is a machine learning tool that predicts the functional potential of alternative splice isoforms. The system uses a Random Forest classifier trained on 45+ multi-dimensional features encompassing RNA-seq data, conservation scores, protein domain annotations, and structural predictions.TRIFID predicts whether a transcript isoform is likely to be functional (capable of producing a stable, functional protein) or non-functional (likely degraded or producing non-functional proteins).
The ML Pipeline
TRIFID’s prediction workflow consists of several interconnected stages:1. Data Loading and Feature Engineering
The pipeline begins by integrating data from multiple sources to create a comprehensive feature set:- Group Normalization: Features are normalized within each gene to capture relative differences between isoforms (see
trifid/utils/utils.py:359) - Delta Scoring: Calculates length and score differences relative to the principal isoform (see
trifid/utils/utils.py:127) - Fragment Correction: Adjusts scores for incomplete transcript fragments (see
trifid/utils/utils.py:173) - One-Hot Encoding: Categorical features like TSL (Transcript Support Level) are encoded (see
trifid/utils/utils.py:480)
2. Model Training
TRIFID uses a Random Forest Classifier from scikit-learn as its core predictive model:Why Random Forest?
Why Random Forest?
Random Forests are ideal for TRIFID because they:
- Handle complex, non-linear relationships between features
- Provide robust predictions without overfitting
- Offer interpretability through feature importance scores
- Work well with mixed data types (continuous and categorical)
- Are resistant to outliers and missing values
3. Nested Cross-Validation
TRIFID employs a rigorous nested cross-validation strategy for model selection (seetrifid/models/select.py:209):
- Outer Loop: 5-fold stratified cross-validation for model evaluation
- Inner Loop: 10-fold stratified cross-validation for hyperparameter tuning
- Metric: Matthews Correlation Coefficient (MCC) as the primary selection criterion
4. Prediction Generation
Once trained, the model generates two key scores for each transcript:TRIFID Scores Explained
- trifid_score: Raw probability (0-1) that the isoform is functional
- norm_trifid_score: Gene-normalized score comparing isoforms within the same gene
- Scores ≥ 0.5 generally indicate functional isoforms
- Normalization helps identify the most functional isoform per gene
Prediction Workflow
Handling Multiple Assemblies
TRIFID supports predictions across multiple genome assemblies and species (seetrifid/models/predict.py:32):
- Human: GRCh38, GRCh37 (both Ensembl and RefSeq)
- Mouse: GRCm39, GRCm38
- Rat: Rnor_6.0
- Zebrafish: GRCz11
- Pig: Sscrofa11.1
- Chimp: Pan_tro_3.0
- Other species: Chicken (GRCg6a), Cow (ARS-UCD1.2), Fly (BDGP6), Worm (WBcel235)
Training Data
The model is trained on curated isoform sets from GENCODE with experimentally validated functional status:- Proteomics evidence
- APPRIS principal isoform annotations
- Experimental validation studies
Performance Metrics
TRIFID evaluates model performance using comprehensive metrics (seetrifid/models/select.py:178):
- Accuracy: Overall prediction correctness
- AUC: Area under the ROC curve
- Balanced Accuracy: Accounts for class imbalance
- F1 Score: Harmonic mean of precision and recall
- MCC (Matthews Correlation Coefficient): Primary metric for model selection
- Precision/Recall: Trade-off between false positives and false negatives
Next Steps
Model Architecture
Deep dive into Random Forest hyperparameters and model selection
Predictive Features
Explore all 45+ features used in predictions
Interpretability
Learn how SHAP values explain individual predictions
Quick Start
Start making predictions with TRIFID