Random Forest Architecture
TRIFID employs a Random Forest Classifier as its core machine learning model. This ensemble learning method combines multiple decision trees to produce robust, accurate predictions for transcript isoform functionality.Model Configuration
The production TRIFID model uses the following hyperparameters:These hyperparameters were selected through extensive nested cross-validation experiments to optimize performance on the training dataset.
Hyperparameter Breakdown
n_estimators (400)
The number of decision trees in the forest.- Why 400? More trees generally improve performance and stability, with diminishing returns after 300-500 trees
- Trade-off: Increased training time and memory usage vs. improved accuracy and reduced variance
- Each tree votes on the final prediction, making the ensemble robust to individual tree errors
max_features (7)
Number of features randomly sampled as candidates at each split point.- Why 7? With 45+ total features, sampling 7 (~15-20%) provides good diversity while maintaining predictive power
- Effect: Introduces decorrelation between trees, reducing overfitting
- Rule of thumb: Often set to √(total_features) for classification, adjusted through hyperparameter tuning
min_samples_leaf (7)
Minimum number of samples required to be at a leaf node.- Why 7? Prevents overfitting by ensuring leaves represent meaningful patterns
- Effect: Smooths predictions and improves generalization
- Trade-off: Higher values = simpler trees (may underfit), lower values = complex trees (may overfit)
class_weight (None)
Equal weighting for functional and non-functional classes.- Why None? The training set is balanced, eliminating need for class weighting
- Alternative:
class_weight='balanced'automatically adjusts weights inversely proportional to class frequencies
random_state (123)
Seed for random number generation.- Purpose: Ensures reproducible results across runs
- Impact: Controls bootstrap sampling and feature selection randomness
Model Selection Protocol
TRIFID uses a sophisticated Nested Cross-Validation strategy to select the optimal model architecture and hyperparameters.Architecture Overview
What is Nested Cross-Validation?
What is Nested Cross-Validation?
Nested CV provides unbiased model performance estimates by:
- Outer Loop (5 folds): Evaluates final model performance
- Inner Loop (10 folds): Tunes hyperparameters
Grid Search Configuration
TRIFID explores multiple model architectures during selection:Hyperparameter Search Space
The grid search systematically tests:
- min_samples_leaf: 10 values (5 through 14)
- This creates 10 unique model configurations
- Each tested with 10-fold inner CV = 100 model fits per outer fold
- Total: 5 outer × 100 inner = 500 model evaluations
Alternative Models Considered
The model selection framework supports multiple algorithms (though Random Forest is default):Model Evaluation Metrics
TRIFID uses comprehensive metrics to assess model performance:Primary Selection Metric: MCC
Matthews Correlation Coefficient (MCC) is used as the primary metric for model selection:Why MCC?
Why MCC?
MCC is superior for binary classification because it:For TRIFID, high MCC means accurately identifying both functional and non-functional isoforms.
- Balanced: Accounts for all confusion matrix categories (TP, TN, FP, FN)
- Robust: Works well even with class imbalance
- Range: -1 (perfect disagreement) to +1 (perfect prediction), 0 = random
- Informative: Captures both precision and recall simultaneously
Training Pipeline
TRIFID provides three training modes:1. Model Selection Mode
trifid/models/train.py:79-83
2. Custom Model Mode
trifid/models/train.py:85-96
3. Pretrained Model Mode
trifid/models/train.py:98-106
Model Persistence
Trained models are serialized using Python’s pickle format:selected_model.pkl: Trained Random Forest classifiermodel_selection_<timestamp>.tsv.gz: Grid search results and metricsmodel_selection_<timestamp>.log: Detailed training logs
Classifier Class
TheClassifier class (see trifid/models/select.py:94) wraps the Random Forest and provides:
Pipeline Integration
The Classifier can include preprocessing steps:Currently,
preprocessing=None (no scaling needed for tree-based models).Performance Considerations
Computational Requirements
- Training time: ~10-30 minutes on 20 cores for full nested CV
- Memory: ~4-8 GB RAM for typical datasets
- Parallelization: Uses
n_jobs=-1to leverage all available CPU cores
Optimization Tips
- Reduce n_estimators: 200-300 trees often sufficient for faster training
- Limit grid search: Focus on 2-3 key hyperparameters
- Use warm_start: Resume training from previous model state
- Sample data: Train on subset for rapid prototyping
Next Steps
Predictive Features
Explore the 45+ features that power TRIFID’s predictions
Interpretability
Understand how SHAP explains model decisions
Training Custom Models
Learn how to train TRIFID on your own data
API Reference
Complete API documentation for model classes