Skip to main content

Overview

The train module provides functionality for training TRIFID models using different protocols: custom models, pretrained models, or automated model selection via nested cross-validation.

Command-Line Usage

python -m trifid.models.train \
    --features config/features.yaml \
    --pretrained

Command-Line Arguments

-c, --custom
flag
default:"False"
Train and save a customized model with specific hyperparameters
-f, --features
string
default:"config/features.yaml"
Path to features selected description YAML file
-m, --model_selection
flag
default:"False"
Perform a nested CV model selection protocol, training and saving the best model
-p, --pretrained
flag
default:"False"
Train TRIFID with a previously trained model
-s, --seed
integer
default:"123"
Random seed for reproducibility

Main Function

main()

Executes the training workflow based on command-line arguments.
from trifid.models.train import main

main()
The function performs the following steps:
  1. Parses command-line arguments
  2. Loads feature configuration from YAML
  3. Loads the TRIFID database and training set
  4. Executes one of three training protocols:
    • Model Selection: Uses ModelSelection class for nested CV
    • Custom Model: Trains a RandomForestClassifier with specified hyperparameters
    • Pretrained Model: Loads and retrains an existing model

Training Protocols

Model Selection Protocol

When --model_selection is specified:
from trifid.models.select import ModelSelection

ms = ModelSelection(
    df_training_set,
    features_col=df_training_set[features],
    target_col="label",
    random_state=123
)
model = ms.get_best_model(outdir="models")

Custom Model Training

When --custom is specified:
from sklearn.ensemble import RandomForestClassifier
from trifid.models.select import Classifier

custom_model = RandomForestClassifier(
    n_estimators=400,
    class_weight=None,
    max_features=7,
    min_samples_leaf=7,
    random_state=123
)

model = Classifier(
    model=custom_model,
    df=df_training_set,
    features_col=df_training_set[features].columns,
    target_col="label",
    random_state=123
)
model.save_model(outdir="models")

Pretrained Model Training

When --pretrained is specified:
import pickle
from trifid.models.select import Classifier

pretrained_model = pickle.load(
    open(os.path.join("models", "selected_model.pkl"), "rb")
)

model = Classifier(
    model=pretrained_model,
    df=df_training_set,
    features_col=df_training_set[features].columns,
    target_col="label",
    random_state=123
)

Default Hyperparameters

The default custom model uses the following hyperparameters:
  • n_estimators: 400
  • class_weight: None
  • max_features: 7
  • min_samples_leaf: 7
  • random_state: Specified by --seed (default: 123)

Input Data

Training Set Format

The training set should be a TSV file with:
  • Feature columns as specified in the features YAML
  • A state column containing:
    • “F” entries (labeled as 1 - functional)
    • “U” entries (labeled as 0 - unfunctional)
  • Additional metadata columns: added, comment

TRIFID Database

Located at data/genomes/GRCh38/g27/trifid_db.tsv.gz, contains:
  • All feature values for transcript isoforms
  • Identifier columns
  • Annotation metadata

Output

The training module generates:
  • selected_model.pkl: Serialized trained model
  • training_set_final.g27.tsv.gz: Processed training set
  • model_selection_*.tsv.gz: Model selection results (if using --model_selection)
  • model_selection_*.log: Training log file

Example Workflows

Train with Model Selection

python -m trifid.models.train \
    --features config/features.yaml \
    --model_selection \
    --seed 42

Train Custom Model

python -m trifid.models.train \
    --features config/features.yaml \
    --custom \
    --seed 42

Retrain Pretrained Model

python -m trifid.models.train \
    --features config/features.yaml \
    --pretrained \
    --seed 42
See the select module for details on:
  • Classifier: Model training and evaluation wrapper
  • ModelSelection: Nested cross-validation model selection
  • Splitter: Train/test splitting utilities

Build docs developers (and LLMs) love