Training Models

TRIFID uses machine learning to predict splice isoform functionality. This guide covers the complete model training workflow, from preparing training data to hyperparameter optimization.

Overview

The training process involves:

Preparing a labeled training set
Selecting features for the model
Choosing a training mode (pretrained, custom, or model selection)
Evaluating model performance
Saving the trained model

Preparing Training Data

TRIFID requires labeled transcripts with functional annotations.

Training Set Format

Your training set should be a TSV file with these columns:

transcript_id    state           evidence
ENST00000380152  FUNCTIONAL      Principal isoform
ENST00000544455  UNFUNCTIONAL    No protein evidence
ENST00000496384  NEUTRAL         Uncertain annotation

Creating Labels

TRIFID uses binary classification:

Label 1 (Functional): Transcripts with experimental evidence of function
Label 0 (Non-functional): Transcripts predicted to be non-functional

The training script (trifid/models/train.py:67-69) automatically converts:

df_training_set.loc[df_training_set["state"].str.contains("F"), "label"] = 1
df_training_set.loc[df_training_set["state"].str.contains("U"), "label"] = 0

Transcripts labeled as “NEUTRAL” are excluded from training to avoid ambiguous examples.

Merging with Features

The training data is merged with your TRIFID database:

# From trifid/models/train.py:62-64
df_features = pd.read_csv(
    os.path.join("data", "genomes", "GRCh38", "g27", "trifid_db.tsv.gz"), 
    sep="\t", compression="gzip"
)

df_training_set = pd.read_csv(
    os.path.join("data", "model", "training_set_initial.g27.tsv.gz"), sep="\t"
)

Feature Selection

Select features that will be used for training.

Essential Features

Core features for good performance:

config/features.yaml

# Structural
- feature: "length_delta_score"
  category: "Structural"
  
# Domain integrity
- feature: "norm_spade"
  category: "APPRIS"
- feature: "pfam_score"
  category: "Domains"

# Splicing support  
- feature: "norm_RNA2sj_cds"
  category: "Splicing"

# Conservation
- feature: "norm_ScorePerCodon"
  category: "PhyloCSF"

Loading Features

The training script loads feature names from your config:

# From trifid/models/train.py:59-60
df_features = pd.DataFrame(utils.parse_yaml(args.features))
features = df_features[~df_features["category"].str.contains("Identifier")]["feature"].values

Start with 10-15 features. Adding too many can lead to overfitting, especially with small training sets.

Training Modes

TRIFID offers three training modes to suit different needs.

Mode 1: Train with Pretrained Model

Use an existing model as starting point:

python -m trifid.models.train \
  --features config/features.yaml \
  --pretrained \
  --seed 123

This loads a saved model and evaluates it on your data without retraining:

# From trifid/models/train.py:98-106
pretrained_model = pickle.load(open(os.path.join("models", "selected_model.pkl"), "rb"))
model = Classifier(
    model=pretrained_model,
    df=df_training_set,
    features_col=df_training_set[features].columns,
    target_col="label",
    random_state=args.seed,
)

Use when:

You want to evaluate TRIFID’s default model on your data
Fine-tuning is not necessary

Mode 2: Train Custom Model

Train a model with specified hyperparameters:

python -m trifid.models.train \
  --features config/features.yaml \
  --custom \
  --seed 123

Defines a Random Forest with fixed parameters:

# From trifid/models/train.py:85-96
custom_model = RandomForestClassifier(
    n_estimators=400,
    class_weight=None,
    max_features=7,
    min_samples_leaf=7,
    random_state=args.seed
)

model = Classifier(
    model=custom_model,
    df=df_training_set,
    features_col=df_training_set[features].columns,
    target_col="label",
    random_state=args.seed,
)
model.save_model(outdir="models")

Use when:

You know optimal hyperparameters from previous experiments
You want fast training without optimization

Mode 3: Model Selection (Recommended)

Perform nested cross-validation to find the best model:

python -m trifid.models.train \
  --features config/features.yaml \
  --model_selection \
  --seed 123

This runs an extensive hyperparameter search:

# From trifid/models/train.py:79-83
ms = ModelSelection(
    df_training_set,
    features_col=df_training_set[features],
    target_col="label",
    random_state=args.seed
)
model = ms.get_best_model(outdir="models")

Use when:

Training a production model
You have sufficient computational resources
Maximum performance is critical

Model selection can take several hours depending on your dataset size and available cores.

Nested Cross-Validation

The model selection process uses nested CV to avoid overfitting.

Architecture

Outer loop: Performance estimation

5-fold stratified split for unbiased performance estimation:

# From trifid/models/select.py:346-348
def _outer_cv(self, shuffle: bool = False):
    cv = StratifiedKFold(n_splits=self.n_outer_splits, shuffle=shuffle, 
                       random_state=self.random_state)
    return cv

Inner loop: Hyperparameter optimization

10-fold cross-validation for hyperparameter selection:

# From trifid/models/select.py:350-352
def _inner_cv(self, shuffle: bool = False):
    cv = StratifiedKFold(n_splits=self.n_inner_splits, shuffle=shuffle,
                       random_state=self.random_state)
    return cv

Grid search

Tests multiple hyperparameter combinations:

# From trifid/models/select.py:429-438
"Random Forest": {
    "model": RandomForestClassifier(
        n_estimators=400, 
        random_state=self.random_state,
        n_jobs=-1
    ),
    "grid1": [{
        "min_samples_leaf": list(range(5, 15)),
    }]
}

Evaluation Metrics

Multiple metrics are computed to assess model quality:

# From trifid/models/select.py:367-378
scores = {
    "Accuracy": accuracy_score(target, predictions),
    "AUC": roc_auc_score(target, probs),
    "Average Precision Score": average_precision_score(target, probs),
    "Balanced Accuracy": balanced_accuracy_score(target, predictions),
    "F1 Score": f1_score(target, predictions),
    "Log Loss": -1 * (log_loss(target, probs)),
    "MCC": matthews_corrcoef(target, predictions),
    "Precision": precision_score(target, predictions),
    "Recall": recall_score(target, predictions),
}

TRIFID uses Matthews Correlation Coefficient (MCC) as the primary metric for model selection, as it’s robust to class imbalance.

Hyperparameter Tuning

For custom models, key hyperparameters to tune:

Random Forest Parameters

n_estimators: Number of trees in the forest

Default: 400
Range: 100-1000
Higher values → better performance but slower

min_samples_leaf: Minimum samples required at leaf nodes

Default: 7
Range: 5-15
Higher values → prevent overfitting

max_features: Number of features to consider for splits

Default: 7
Range: 5-10
Lower values → more diversity between trees

class_weight: Handle class imbalance

Options: None, 'balanced'
Use 'balanced' if you have imbalanced classes

Example: Custom Hyperparameters

from sklearn.ensemble import RandomForestClassifier

# Optimized for small training sets
model = RandomForestClassifier(
    n_estimators=500,
    min_samples_leaf=10,
    max_features=8,
    class_weight='balanced',
    random_state=123,
    n_jobs=-1  # Use all CPU cores
)

Training Output

The training process generates several outputs.

Saved Model File

models/
├── selected_model.pkl              # Best model from selection
├── custom_model.pkl                # Custom trained model
└── model_selection_2026-03-04.tsv.gz  # Results summary

Load a trained model:

import pickle

with open('models/selected_model.pkl', 'rb') as f:
    model = pickle.load(f)

Training Logs

Model selection creates detailed logs:

models/model_selection_2026-03-04T10-30-45.log

Example log output:

TRIFID Nested CV (5 outer folds - 10 inner folds) Model Selection
Training instances: 1200 (Test: 300)
Random State seed: 123
Processors: 20
----------

Algorithm: Random Forest
  Inner loop:
  (1)
  Best params: {'min_samples_leaf': 7}
  Train MCC: 0.8234
  Test MCC: 0.7891
  ...
  
  Avg. MCC (on validation folds): 0.801 +/- 0.023

Performance Metrics

Access model performance:

from trifid.models.select import Classifier

model = Classifier(
    model=your_model,
    df=df_training_set,
    features_col=features,
    target_col="label",
    random_state=123
)

# Get metrics
print(model.evaluate)
print(model.classification_report)
print(model.confusion_matrix)

Output:

                          metric
Accuracy                   0.8234
AUC                        0.8912
Balanced Accuracy          0.8156
F1 Score                   0.7923
MCC                        0.6891

Training Set Requirements

Minimum Sample Size

Recommended minimum: 500 labeled transcripts
Better performance: 1000+ transcripts
Optimal: 2000+ transcripts with diverse functional states

Class Balance

Aim for reasonable balance between classes:

# Check class distribution
print(df_training_set['label'].value_counts())

# Output:
# 1    650  # Functional
# 0    550  # Non-functional

If imbalanced, use:

# From trifid/utils/utils.py:98-110
def balanced_training_set(df: pd.DataFrame, seed: int = 1) -> pd.DataFrame:
    return pd.concat([
        df[df["label"] == 1],
        df[df["label"] == 0].sample(df[df["label"] == 1].shape[0], 
                                    random_state=seed)
    ]).reset_index(drop=True)

df_balanced = balanced_training_set(df_training_set)

Validation Strategies

Cross-Validation

Evaluate model stability:

model = Classifier(...)  # Your trained model

# 5-fold cross-validation
cv_results = model.cross_validate
print(cv_results)

Gene-Level Splitting

For a more conservative estimate, split by genes rather than transcripts:

model = Classifier(
    model=your_model,
    df=df_training_set,
    features_col=features,
    target_col="label",
    random_state=123,
    split_by_gene=True  # Ensures all isoforms of a gene stay together
)

This prevents data leakage when genes have multiple isoforms.

Troubleshooting

Overfitting

Symptoms:

High training accuracy (above 95%) but low test accuracy (below 70%)
Large gap between train and validation MCC

Solutions:

Increase min_samples_leaf
Reduce number of features
Add more training data
Use class_weight='balanced'

Underfitting

Symptoms:

Low training accuracy (below 75%)
Similar train and test performance but both poor

Solutions:

Add more informative features
Decrease min_samples_leaf
Increase n_estimators
Check for missing values in features

Long Training Time

Solutions:

Reduce hyperparameter grid size
Use fewer outer/inner CV folds
Decrease n_estimators
Use more CPU cores (n_jobs=-1)

Memory Errors

Solutions:

Train on a subset of features
Reduce n_estimators
Process in batches
Use a machine with more RAM

Best Practices

Use a held-out test set

Reserve 20-30% of labeled data for final testing, completely separate from training and model selection.

Track random seeds

Always set and document the random seed for reproducibility:

python -m trifid.models.train --seed 42

Version your models

Save models with descriptive names:

model.save_model(outdir="models", name="trifid_v1_grch38_mcc0.82.pkl")

Monitor multiple metrics

Don’t rely on accuracy alone. Check MCC, AUC, and F1 score together.

Next Steps

Make Predictions

Apply your trained model to score isoforms

Interpret Results

Understand TRIFID scores and SHAP explanations

Get Started

Core Concepts

User Guides

TRIFID Modules

Data & Models

​Overview

​Preparing Training Data

​Training Set Format

​Creating Labels

​Merging with Features

​Feature Selection

​Essential Features

​Loading Features

​Training Modes

​Mode 1: Train with Pretrained Model

​Mode 2: Train Custom Model

​Mode 3: Model Selection (Recommended)

​Nested Cross-Validation

​Architecture

​Evaluation Metrics

​Hyperparameter Tuning

​Random Forest Parameters

​Example: Custom Hyperparameters

​Training Output

​Saved Model File

​Training Logs

​Performance Metrics

​Training Set Requirements

​Minimum Sample Size

​Class Balance

​Validation Strategies

​Cross-Validation

​Gene-Level Splitting

​Troubleshooting

​Overfitting

​Underfitting

​Long Training Time

​Memory Errors

​Best Practices

​Next Steps

Make Predictions

Interpret Results

Build docs developers (and LLMs) love

Overview

Preparing Training Data

Training Set Format

Creating Labels

Merging with Features

Feature Selection

Essential Features

Loading Features

Training Modes

Mode 1: Train with Pretrained Model

Mode 2: Train Custom Model

Mode 3: Model Selection (Recommended)

Nested Cross-Validation

Architecture

Evaluation Metrics

Hyperparameter Tuning

Random Forest Parameters

Example: Custom Hyperparameters

Training Output

Saved Model File

Training Logs

Performance Metrics

Training Set Requirements

Minimum Sample Size

Class Balance

Validation Strategies

Cross-Validation

Gene-Level Splitting

Troubleshooting

Overfitting

Underfitting

Long Training Time

Memory Errors

Best Practices

Next Steps