Overview
Thetrain module provides functionality for training TRIFID models using different protocols: custom models, pretrained models, or automated model selection via nested cross-validation.
Command-Line Usage
Command-Line Arguments
Train and save a customized model with specific hyperparameters
Path to features selected description YAML file
Perform a nested CV model selection protocol, training and saving the best model
Train TRIFID with a previously trained model
Random seed for reproducibility
Main Function
main()
Executes the training workflow based on command-line arguments.- Parses command-line arguments
- Loads feature configuration from YAML
- Loads the TRIFID database and training set
- Executes one of three training protocols:
- Model Selection: Uses
ModelSelectionclass for nested CV - Custom Model: Trains a
RandomForestClassifierwith specified hyperparameters - Pretrained Model: Loads and retrains an existing model
- Model Selection: Uses
Training Protocols
Model Selection Protocol
When--model_selection is specified:
Custom Model Training
When--custom is specified:
Pretrained Model Training
When--pretrained is specified:
Default Hyperparameters
The default custom model uses the following hyperparameters:- n_estimators: 400
- class_weight: None
- max_features: 7
- min_samples_leaf: 7
- random_state: Specified by
--seed(default: 123)
Input Data
Training Set Format
The training set should be a TSV file with:- Feature columns as specified in the features YAML
- A
statecolumn containing:- “F” entries (labeled as 1 - functional)
- “U” entries (labeled as 0 - unfunctional)
- Additional metadata columns:
added,comment
TRIFID Database
Located atdata/genomes/GRCh38/g27/trifid_db.tsv.gz, contains:
- All feature values for transcript isoforms
- Identifier columns
- Annotation metadata
Output
The training module generates:- selected_model.pkl: Serialized trained model
- training_set_final.g27.tsv.gz: Processed training set
- model_selection_*.tsv.gz: Model selection results (if using
--model_selection) - model_selection_*.log: Training log file
Example Workflows
Train with Model Selection
Train Custom Model
Retrain Pretrained Model
Related Classes
See the select module for details on:Classifier: Model training and evaluation wrapperModelSelection: Nested cross-validation model selectionSplitter: Train/test splitting utilities