Skip to main content
TRIFID uses machine learning to predict splice isoform functionality. This guide covers the complete model training workflow, from preparing training data to hyperparameter optimization.

Overview

The training process involves:
  1. Preparing a labeled training set
  2. Selecting features for the model
  3. Choosing a training mode (pretrained, custom, or model selection)
  4. Evaluating model performance
  5. Saving the trained model

Preparing Training Data

TRIFID requires labeled transcripts with functional annotations.

Training Set Format

Your training set should be a TSV file with these columns:
transcript_id    state           evidence
ENST00000380152  FUNCTIONAL      Principal isoform
ENST00000544455  UNFUNCTIONAL    No protein evidence
ENST00000496384  NEUTRAL         Uncertain annotation

Creating Labels

TRIFID uses binary classification:
  • Label 1 (Functional): Transcripts with experimental evidence of function
  • Label 0 (Non-functional): Transcripts predicted to be non-functional
The training script (trifid/models/train.py:67-69) automatically converts:
df_training_set.loc[df_training_set["state"].str.contains("F"), "label"] = 1
df_training_set.loc[df_training_set["state"].str.contains("U"), "label"] = 0
Transcripts labeled as “NEUTRAL” are excluded from training to avoid ambiguous examples.

Merging with Features

The training data is merged with your TRIFID database:
# From trifid/models/train.py:62-64
df_features = pd.read_csv(
    os.path.join("data", "genomes", "GRCh38", "g27", "trifid_db.tsv.gz"), 
    sep="\t", compression="gzip"
)

df_training_set = pd.read_csv(
    os.path.join("data", "model", "training_set_initial.g27.tsv.gz"), sep="\t"
)

Feature Selection

Select features that will be used for training.

Essential Features

Core features for good performance:
config/features.yaml
# Structural
- feature: "length_delta_score"
  category: "Structural"
  
# Domain integrity
- feature: "norm_spade"
  category: "APPRIS"
- feature: "pfam_score"
  category: "Domains"

# Splicing support  
- feature: "norm_RNA2sj_cds"
  category: "Splicing"

# Conservation
- feature: "norm_ScorePerCodon"
  category: "PhyloCSF"

Loading Features

The training script loads feature names from your config:
# From trifid/models/train.py:59-60
df_features = pd.DataFrame(utils.parse_yaml(args.features))
features = df_features[~df_features["category"].str.contains("Identifier")]["feature"].values
Start with 10-15 features. Adding too many can lead to overfitting, especially with small training sets.

Training Modes

TRIFID offers three training modes to suit different needs.

Mode 1: Train with Pretrained Model

Use an existing model as starting point:
python -m trifid.models.train \
  --features config/features.yaml \
  --pretrained \
  --seed 123
This loads a saved model and evaluates it on your data without retraining:
# From trifid/models/train.py:98-106
pretrained_model = pickle.load(open(os.path.join("models", "selected_model.pkl"), "rb"))
model = Classifier(
    model=pretrained_model,
    df=df_training_set,
    features_col=df_training_set[features].columns,
    target_col="label",
    random_state=args.seed,
)
Use when:
  • You want to evaluate TRIFID’s default model on your data
  • Fine-tuning is not necessary

Mode 2: Train Custom Model

Train a model with specified hyperparameters:
python -m trifid.models.train \
  --features config/features.yaml \
  --custom \
  --seed 123
Defines a Random Forest with fixed parameters:
# From trifid/models/train.py:85-96
custom_model = RandomForestClassifier(
    n_estimators=400,
    class_weight=None,
    max_features=7,
    min_samples_leaf=7,
    random_state=args.seed
)

model = Classifier(
    model=custom_model,
    df=df_training_set,
    features_col=df_training_set[features].columns,
    target_col="label",
    random_state=args.seed,
)
model.save_model(outdir="models")
Use when:
  • You know optimal hyperparameters from previous experiments
  • You want fast training without optimization
Perform nested cross-validation to find the best model:
python -m trifid.models.train \
  --features config/features.yaml \
  --model_selection \
  --seed 123
This runs an extensive hyperparameter search:
# From trifid/models/train.py:79-83
ms = ModelSelection(
    df_training_set,
    features_col=df_training_set[features],
    target_col="label",
    random_state=args.seed
)
model = ms.get_best_model(outdir="models")
Use when:
  • Training a production model
  • You have sufficient computational resources
  • Maximum performance is critical
Model selection can take several hours depending on your dataset size and available cores.

Nested Cross-Validation

The model selection process uses nested CV to avoid overfitting.

Architecture

1

Outer loop: Performance estimation

5-fold stratified split for unbiased performance estimation:
# From trifid/models/select.py:346-348
def _outer_cv(self, shuffle: bool = False):
    cv = StratifiedKFold(n_splits=self.n_outer_splits, shuffle=shuffle, 
                       random_state=self.random_state)
    return cv
2

Inner loop: Hyperparameter optimization

10-fold cross-validation for hyperparameter selection:
# From trifid/models/select.py:350-352
def _inner_cv(self, shuffle: bool = False):
    cv = StratifiedKFold(n_splits=self.n_inner_splits, shuffle=shuffle,
                       random_state=self.random_state)
    return cv
3

Grid search

Tests multiple hyperparameter combinations:
# From trifid/models/select.py:429-438
"Random Forest": {
    "model": RandomForestClassifier(
        n_estimators=400, 
        random_state=self.random_state,
        n_jobs=-1
    ),
    "grid1": [{
        "min_samples_leaf": list(range(5, 15)),
    }]
}

Evaluation Metrics

Multiple metrics are computed to assess model quality:
# From trifid/models/select.py:367-378
scores = {
    "Accuracy": accuracy_score(target, predictions),
    "AUC": roc_auc_score(target, probs),
    "Average Precision Score": average_precision_score(target, probs),
    "Balanced Accuracy": balanced_accuracy_score(target, predictions),
    "F1 Score": f1_score(target, predictions),
    "Log Loss": -1 * (log_loss(target, probs)),
    "MCC": matthews_corrcoef(target, predictions),
    "Precision": precision_score(target, predictions),
    "Recall": recall_score(target, predictions),
}
TRIFID uses Matthews Correlation Coefficient (MCC) as the primary metric for model selection, as it’s robust to class imbalance.

Hyperparameter Tuning

For custom models, key hyperparameters to tune:

Random Forest Parameters

n_estimators: Number of trees in the forest
  • Default: 400
  • Range: 100-1000
  • Higher values → better performance but slower
min_samples_leaf: Minimum samples required at leaf nodes
  • Default: 7
  • Range: 5-15
  • Higher values → prevent overfitting
max_features: Number of features to consider for splits
  • Default: 7
  • Range: 5-10
  • Lower values → more diversity between trees
class_weight: Handle class imbalance
  • Options: None, 'balanced'
  • Use 'balanced' if you have imbalanced classes

Example: Custom Hyperparameters

from sklearn.ensemble import RandomForestClassifier

# Optimized for small training sets
model = RandomForestClassifier(
    n_estimators=500,
    min_samples_leaf=10,
    max_features=8,
    class_weight='balanced',
    random_state=123,
    n_jobs=-1  # Use all CPU cores
)

Training Output

The training process generates several outputs.

Saved Model File

models/
├── selected_model.pkl              # Best model from selection
├── custom_model.pkl                # Custom trained model
└── model_selection_2026-03-04.tsv.gz  # Results summary
Load a trained model:
import pickle

with open('models/selected_model.pkl', 'rb') as f:
    model = pickle.load(f)

Training Logs

Model selection creates detailed logs:
models/model_selection_2026-03-04T10-30-45.log
Example log output:
TRIFID Nested CV (5 outer folds - 10 inner folds) Model Selection
Training instances: 1200 (Test: 300)
Random State seed: 123
Processors: 20
----------

Algorithm: Random Forest
  Inner loop:
  (1)
  Best params: {'min_samples_leaf': 7}
  Train MCC: 0.8234
  Test MCC: 0.7891
  ...
  
  Avg. MCC (on validation folds): 0.801 +/- 0.023

Performance Metrics

Access model performance:
from trifid.models.select import Classifier

model = Classifier(
    model=your_model,
    df=df_training_set,
    features_col=features,
    target_col="label",
    random_state=123
)

# Get metrics
print(model.evaluate)
print(model.classification_report)
print(model.confusion_matrix)
Output:
                          metric
Accuracy                   0.8234
AUC                        0.8912
Balanced Accuracy          0.8156
F1 Score                   0.7923
MCC                        0.6891

Training Set Requirements

Minimum Sample Size

  • Recommended minimum: 500 labeled transcripts
  • Better performance: 1000+ transcripts
  • Optimal: 2000+ transcripts with diverse functional states

Class Balance

Aim for reasonable balance between classes:
# Check class distribution
print(df_training_set['label'].value_counts())

# Output:
# 1    650  # Functional
# 0    550  # Non-functional
If imbalanced, use:
# From trifid/utils/utils.py:98-110
def balanced_training_set(df: pd.DataFrame, seed: int = 1) -> pd.DataFrame:
    return pd.concat([
        df[df["label"] == 1],
        df[df["label"] == 0].sample(df[df["label"] == 1].shape[0], 
                                    random_state=seed)
    ]).reset_index(drop=True)

df_balanced = balanced_training_set(df_training_set)

Validation Strategies

Cross-Validation

Evaluate model stability:
model = Classifier(...)  # Your trained model

# 5-fold cross-validation
cv_results = model.cross_validate
print(cv_results)

Gene-Level Splitting

For a more conservative estimate, split by genes rather than transcripts:
model = Classifier(
    model=your_model,
    df=df_training_set,
    features_col=features,
    target_col="label",
    random_state=123,
    split_by_gene=True  # Ensures all isoforms of a gene stay together
)
This prevents data leakage when genes have multiple isoforms.

Troubleshooting

Overfitting

Symptoms:
  • High training accuracy (above 95%) but low test accuracy (below 70%)
  • Large gap between train and validation MCC
Solutions:
  1. Increase min_samples_leaf
  2. Reduce number of features
  3. Add more training data
  4. Use class_weight='balanced'

Underfitting

Symptoms:
  • Low training accuracy (below 75%)
  • Similar train and test performance but both poor
Solutions:
  1. Add more informative features
  2. Decrease min_samples_leaf
  3. Increase n_estimators
  4. Check for missing values in features

Long Training Time

Solutions:
  • Reduce hyperparameter grid size
  • Use fewer outer/inner CV folds
  • Decrease n_estimators
  • Use more CPU cores (n_jobs=-1)

Memory Errors

Solutions:
  • Train on a subset of features
  • Reduce n_estimators
  • Process in batches
  • Use a machine with more RAM

Best Practices

Reserve 20-30% of labeled data for final testing, completely separate from training and model selection.
Always set and document the random seed for reproducibility:
python -m trifid.models.train --seed 42
Save models with descriptive names:
model.save_model(outdir="models", name="trifid_v1_grch38_mcc0.82.pkl")
Don’t rely on accuracy alone. Check MCC, AUC, and F1 score together.

Next Steps

Make Predictions

Apply your trained model to score isoforms

Interpret Results

Understand TRIFID scores and SHAP explanations

Build docs developers (and LLMs) love