TRIFID uses machine learning to predict splice isoform functionality. This guide covers the complete model training workflow, from preparing training data to hyperparameter optimization.
Overview
The training process involves:
Preparing a labeled training set
Selecting features for the model
Choosing a training mode (pretrained, custom, or model selection)
Evaluating model performance
Saving the trained model
Preparing Training Data
TRIFID requires labeled transcripts with functional annotations.
Your training set should be a TSV file with these columns:
transcript_id state evidence
ENST00000380152 FUNCTIONAL Principal isoform
ENST00000544455 UNFUNCTIONAL No protein evidence
ENST00000496384 NEUTRAL Uncertain annotation
Creating Labels
TRIFID uses binary classification:
Label 1 (Functional) : Transcripts with experimental evidence of function
Label 0 (Non-functional) : Transcripts predicted to be non-functional
The training script (trifid/models/train.py:67-69) automatically converts:
df_training_set.loc[df_training_set[ "state" ].str.contains( "F" ), "label" ] = 1
df_training_set.loc[df_training_set[ "state" ].str.contains( "U" ), "label" ] = 0
Transcripts labeled as “NEUTRAL” are excluded from training to avoid ambiguous examples.
Merging with Features
The training data is merged with your TRIFID database:
# From trifid/models/train.py:62-64
df_features = pd.read_csv(
os.path.join( "data" , "genomes" , "GRCh38" , "g27" , "trifid_db.tsv.gz" ),
sep = " \t " , compression = "gzip"
)
df_training_set = pd.read_csv(
os.path.join( "data" , "model" , "training_set_initial.g27.tsv.gz" ), sep = " \t "
)
Feature Selection
Select features that will be used for training.
Essential Features
Core features for good performance:
# Structural
- feature : "length_delta_score"
category : "Structural"
# Domain integrity
- feature : "norm_spade"
category : "APPRIS"
- feature : "pfam_score"
category : "Domains"
# Splicing support
- feature : "norm_RNA2sj_cds"
category : "Splicing"
# Conservation
- feature : "norm_ScorePerCodon"
category : "PhyloCSF"
Loading Features
The training script loads feature names from your config:
# From trifid/models/train.py:59-60
df_features = pd.DataFrame(utils.parse_yaml(args.features))
features = df_features[ ~ df_features[ "category" ].str.contains( "Identifier" )][ "feature" ].values
Start with 10-15 features. Adding too many can lead to overfitting, especially with small training sets.
Training Modes
TRIFID offers three training modes to suit different needs.
Mode 1: Train with Pretrained Model
Use an existing model as starting point:
python -m trifid.models.train \
--features config/features.yaml \
--pretrained \
--seed 123
This loads a saved model and evaluates it on your data without retraining:
# From trifid/models/train.py:98-106
pretrained_model = pickle.load( open (os.path.join( "models" , "selected_model.pkl" ), "rb" ))
model = Classifier(
model = pretrained_model,
df = df_training_set,
features_col = df_training_set[features].columns,
target_col = "label" ,
random_state = args.seed,
)
Use when:
You want to evaluate TRIFID’s default model on your data
Fine-tuning is not necessary
Mode 2: Train Custom Model
Train a model with specified hyperparameters:
python -m trifid.models.train \
--features config/features.yaml \
--custom \
--seed 123
Defines a Random Forest with fixed parameters:
# From trifid/models/train.py:85-96
custom_model = RandomForestClassifier(
n_estimators = 400 ,
class_weight = None ,
max_features = 7 ,
min_samples_leaf = 7 ,
random_state = args.seed
)
model = Classifier(
model = custom_model,
df = df_training_set,
features_col = df_training_set[features].columns,
target_col = "label" ,
random_state = args.seed,
)
model.save_model( outdir = "models" )
Use when:
You know optimal hyperparameters from previous experiments
You want fast training without optimization
Mode 3: Model Selection (Recommended)
Perform nested cross-validation to find the best model:
python -m trifid.models.train \
--features config/features.yaml \
--model_selection \
--seed 123
This runs an extensive hyperparameter search:
# From trifid/models/train.py:79-83
ms = ModelSelection(
df_training_set,
features_col = df_training_set[features],
target_col = "label" ,
random_state = args.seed
)
model = ms.get_best_model( outdir = "models" )
Use when:
Training a production model
You have sufficient computational resources
Maximum performance is critical
Model selection can take several hours depending on your dataset size and available cores.
Nested Cross-Validation
The model selection process uses nested CV to avoid overfitting.
Architecture
Outer loop: Performance estimation
5-fold stratified split for unbiased performance estimation: # From trifid/models/select.py:346-348
def _outer_cv ( self , shuffle : bool = False ):
cv = StratifiedKFold( n_splits = self .n_outer_splits, shuffle = shuffle,
random_state = self .random_state)
return cv
Inner loop: Hyperparameter optimization
10-fold cross-validation for hyperparameter selection: # From trifid/models/select.py:350-352
def _inner_cv ( self , shuffle : bool = False ):
cv = StratifiedKFold( n_splits = self .n_inner_splits, shuffle = shuffle,
random_state = self .random_state)
return cv
Grid search
Tests multiple hyperparameter combinations: # From trifid/models/select.py:429-438
"Random Forest" : {
"model" : RandomForestClassifier(
n_estimators = 400 ,
random_state = self .random_state,
n_jobs =- 1
),
"grid1" : [{
"min_samples_leaf" : list ( range ( 5 , 15 )),
}]
}
Evaluation Metrics
Multiple metrics are computed to assess model quality:
# From trifid/models/select.py:367-378
scores = {
"Accuracy" : accuracy_score(target, predictions),
"AUC" : roc_auc_score(target, probs),
"Average Precision Score" : average_precision_score(target, probs),
"Balanced Accuracy" : balanced_accuracy_score(target, predictions),
"F1 Score" : f1_score(target, predictions),
"Log Loss" : - 1 * (log_loss(target, probs)),
"MCC" : matthews_corrcoef(target, predictions),
"Precision" : precision_score(target, predictions),
"Recall" : recall_score(target, predictions),
}
TRIFID uses Matthews Correlation Coefficient (MCC) as the primary metric for model selection, as it’s robust to class imbalance.
Hyperparameter Tuning
For custom models, key hyperparameters to tune:
Random Forest Parameters
n_estimators : Number of trees in the forest
Default: 400
Range: 100-1000
Higher values → better performance but slower
min_samples_leaf : Minimum samples required at leaf nodes
Default: 7
Range: 5-15
Higher values → prevent overfitting
max_features : Number of features to consider for splits
Default: 7
Range: 5-10
Lower values → more diversity between trees
class_weight : Handle class imbalance
Options: None, 'balanced'
Use 'balanced' if you have imbalanced classes
Example: Custom Hyperparameters
from sklearn.ensemble import RandomForestClassifier
# Optimized for small training sets
model = RandomForestClassifier(
n_estimators = 500 ,
min_samples_leaf = 10 ,
max_features = 8 ,
class_weight = 'balanced' ,
random_state = 123 ,
n_jobs =- 1 # Use all CPU cores
)
Training Output
The training process generates several outputs.
Saved Model File
models/
├── selected_model.pkl # Best model from selection
├── custom_model.pkl # Custom trained model
└── model_selection_2026-03-04.tsv.gz # Results summary
Load a trained model:
import pickle
with open ( 'models/selected_model.pkl' , 'rb' ) as f:
model = pickle.load(f)
Training Logs
Model selection creates detailed logs:
models/model_selection_2026-03-04T10-30-45.log
Example log output:
TRIFID Nested CV (5 outer folds - 10 inner folds) Model Selection
Training instances: 1200 (Test: 300)
Random State seed: 123
Processors: 20
----------
Algorithm: Random Forest
Inner loop:
(1)
Best params: {'min_samples_leaf': 7}
Train MCC: 0.8234
Test MCC: 0.7891
...
Avg. MCC (on validation folds): 0.801 +/- 0.023
Access model performance:
from trifid.models.select import Classifier
model = Classifier(
model = your_model,
df = df_training_set,
features_col = features,
target_col = "label" ,
random_state = 123
)
# Get metrics
print (model.evaluate)
print (model.classification_report)
print (model.confusion_matrix)
Output:
metric
Accuracy 0.8234
AUC 0.8912
Balanced Accuracy 0.8156
F1 Score 0.7923
MCC 0.6891
Training Set Requirements
Minimum Sample Size
Recommended minimum : 500 labeled transcripts
Better performance : 1000+ transcripts
Optimal : 2000+ transcripts with diverse functional states
Class Balance
Aim for reasonable balance between classes:
# Check class distribution
print (df_training_set[ 'label' ].value_counts())
# Output:
# 1 650 # Functional
# 0 550 # Non-functional
If imbalanced, use:
# From trifid/utils/utils.py:98-110
def balanced_training_set ( df : pd.DataFrame, seed : int = 1 ) -> pd.DataFrame:
return pd.concat([
df[df[ "label" ] == 1 ],
df[df[ "label" ] == 0 ].sample(df[df[ "label" ] == 1 ].shape[ 0 ],
random_state = seed)
]).reset_index( drop = True )
df_balanced = balanced_training_set(df_training_set)
Validation Strategies
Cross-Validation
Evaluate model stability:
model = Classifier( ... ) # Your trained model
# 5-fold cross-validation
cv_results = model.cross_validate
print (cv_results)
Gene-Level Splitting
For a more conservative estimate, split by genes rather than transcripts:
model = Classifier(
model = your_model,
df = df_training_set,
features_col = features,
target_col = "label" ,
random_state = 123 ,
split_by_gene = True # Ensures all isoforms of a gene stay together
)
This prevents data leakage when genes have multiple isoforms.
Troubleshooting
Overfitting
Symptoms:
High training accuracy (above 95%) but low test accuracy (below 70%)
Large gap between train and validation MCC
Solutions:
Increase min_samples_leaf
Reduce number of features
Add more training data
Use class_weight='balanced'
Underfitting
Symptoms:
Low training accuracy (below 75%)
Similar train and test performance but both poor
Solutions:
Add more informative features
Decrease min_samples_leaf
Increase n_estimators
Check for missing values in features
Long Training Time
Solutions:
Reduce hyperparameter grid size
Use fewer outer/inner CV folds
Decrease n_estimators
Use more CPU cores (n_jobs=-1)
Memory Errors
Solutions:
Train on a subset of features
Reduce n_estimators
Process in batches
Use a machine with more RAM
Best Practices
Reserve 20-30% of labeled data for final testing, completely separate from training and model selection.
Always set and document the random seed for reproducibility: python -m trifid.models.train --seed 42
Save models with descriptive names: model.save_model( outdir = "models" , name = "trifid_v1_grch38_mcc0.82.pkl" )
Don’t rely on accuracy alone. Check MCC, AUC, and F1 score together.
Next Steps
Make Predictions Apply your trained model to score isoforms
Interpret Results Understand TRIFID scores and SHAP explanations