Model Architecture

Random Forest Architecture

TRIFID employs a Random Forest Classifier as its core machine learning model. This ensemble learning method combines multiple decision trees to produce robust, accurate predictions for transcript isoform functionality.

Model Configuration

The production TRIFID model uses the following hyperparameters:

# From trifid/models/train.py:86-89
RandomForestClassifier(
    n_estimators=400,      # Number of trees in the forest
    class_weight=None,     # Equal weight to both classes
    max_features=7,        # Features considered at each split
    min_samples_leaf=7,    # Minimum samples required at leaf nodes
    random_state=123       # Seed for reproducibility
)

These hyperparameters were selected through extensive nested cross-validation experiments to optimize performance on the training dataset.

Hyperparameter Breakdown

n_estimators (400)

The number of decision trees in the forest.

Why 400? More trees generally improve performance and stability, with diminishing returns after 300-500 trees
Trade-off: Increased training time and memory usage vs. improved accuracy and reduced variance
Each tree votes on the final prediction, making the ensemble robust to individual tree errors

max_features (7)

Number of features randomly sampled as candidates at each split point.

Why 7? With 45+ total features, sampling 7 (~15-20%) provides good diversity while maintaining predictive power
Effect: Introduces decorrelation between trees, reducing overfitting
Rule of thumb: Often set to √(total_features) for classification, adjusted through hyperparameter tuning

min_samples_leaf (7)

Minimum number of samples required to be at a leaf node.

Why 7? Prevents overfitting by ensuring leaves represent meaningful patterns
Effect: Smooths predictions and improves generalization
Trade-off: Higher values = simpler trees (may underfit), lower values = complex trees (may overfit)

class_weight (None)

Equal weighting for functional and non-functional classes.

Why None? The training set is balanced, eliminating need for class weighting
Alternative: class_weight='balanced' automatically adjusts weights inversely proportional to class frequencies

random_state (123)

Seed for random number generation.

Purpose: Ensures reproducible results across runs
Impact: Controls bootstrap sampling and feature selection randomness

Model Selection Protocol

TRIFID uses a sophisticated Nested Cross-Validation strategy to select the optimal model architecture and hyperparameters.

Architecture Overview

# From trifid/models/select.py:209-242
class ModelSelection(Splitter):
    def __init__(
        self,
        df: list,
        features_col: list,
        target_col: str,
        random_state: int = 123,
        n_outer_splits: int = 5,   # Outer CV folds
        n_inner_splits: int = 10,  # Inner CV folds
        n_jobs: int = 20,          # Parallel processors
    )

What is Nested Cross-Validation?

Nested CV provides unbiased model performance estimates by:

Outer Loop (5 folds): Evaluates final model performance
Inner Loop (10 folds): Tunes hyperparameters

This prevents information leakage between training and validation, which would occur with simple CV.Key Insight: The inner loop finds the best hyperparameters, while the outer loop tests how well those hyperparameters generalize.

Grid Search Configuration

TRIFID explores multiple model architectures during selection:

# From trifid/models/select.py:429-439
"Random Forest": {
    "model": RandomForestClassifier(
        n_estimators=400, 
        random_state=self.random_state, 
        n_jobs=-1
    ),
    "grid1": [{
        "min_samples_leaf": list(range(5, 15)),  # Tests 5-14
        # Additional parameters can be uncommented:
        # "max_features": list(range(5, 10)),
        # "class_weight": [None, 'balanced']
    }]
}

Hyperparameter Search Space

The grid search systematically tests:

min_samples_leaf: 10 values (5 through 14)
This creates 10 unique model configurations
Each tested with 10-fold inner CV = 100 model fits per outer fold
Total: 5 outer × 100 inner = 500 model evaluations

Alternative Models Considered

The model selection framework supports multiple algorithms (though Random Forest is default):

# From trifid/models/select.py:403-456
# Available but commented out:
# - Decision Tree
# - AdaBoost
# - Extremely Randomized Trees
# - Gradient Boosting Machine
# - K-Nearest Neighbors
# - Logistic Regression
# - Support Vector Machine
# - XGBoost

You can enable alternative models by uncommenting them in trifid/models/select.py:396-467. Each requires its own hyperparameter grid.

Model Evaluation Metrics

TRIFID uses comprehensive metrics to assess model performance:

# From trifid/models/select.py:367-378
scores = {
    "Accuracy": accuracy_score(target, predictions),
    "AUC": roc_auc_score(target, probs),
    "Average Precision Score": average_precision_score(target, probs),
    "Balanced Accuracy": balanced_accuracy_score(target, predictions),
    "F1 Score": f1_score(target, predictions),
    "Log Loss": -1 * (log_loss(target, probs)),
    "MCC": matthews_corrcoef(target, predictions),
    "Precision": precision_score(target, predictions),
    "Recall": recall_score(target, predictions),
}

Primary Selection Metric: MCC

Matthews Correlation Coefficient (MCC) is used as the primary metric for model selection:

# From trifid/models/select.py:243
def get_best_model(self, outdir: str = None, selection_metric="MCC") -> str:

Why MCC?

MCC is superior for binary classification because it:

Balanced: Accounts for all confusion matrix categories (TP, TN, FP, FN)
Robust: Works well even with class imbalance
Range: -1 (perfect disagreement) to +1 (perfect prediction), 0 = random
Informative: Captures both precision and recall simultaneously

Formula:

MCC = (TP×TN - FP×FN) / √((TP+FP)(TP+FN)(TN+FP)(TN+FN))

For TRIFID, high MCC means accurately identifying both functional and non-functional isoforms.

Training Pipeline

TRIFID provides three training modes:

1. Model Selection Mode

python -m trifid.models.train \
    --model_selection \
    --features config/features.yaml

Performs full nested CV, selects best hyperparameters, trains final model. Code reference: trifid/models/train.py:79-83

2. Custom Model Mode

python -m trifid.models.train \
    --custom \
    --features config/features.yaml

Trains a model with predefined hyperparameters (skips grid search). Code reference: trifid/models/train.py:85-96

3. Pretrained Model Mode

python -m trifid.models.train \
    --pretrained \
    --features config/features.yaml

Loads an existing model and retrains on new data. Code reference: trifid/models/train.py:98-106

All training modes require a labeled training set at data/model/training_set_initial.g27.tsv.gz with functional/non-functional labels.

Model Persistence

Trained models are serialized using Python’s pickle format:

# From trifid/models/select.py:326-328
def save_model(self, outdir: str):
    with open(os.path.join(outdir, "selected_model.pkl"), "wb") as model_filepath:
        pickle.dump(self.model, model_filepath)

Saved artifacts:

selected_model.pkl: Trained Random Forest classifier
model_selection_<timestamp>.tsv.gz: Grid search results and metrics
model_selection_<timestamp>.log: Detailed training logs

Classifier Class

The Classifier class (see trifid/models/select.py:94) wraps the Random Forest and provides:

class Classifier(Splitter):
    # Key methods:
    def fit()                  # Train the model
    def evaluate              # Calculate performance metrics
    def classification_report # Detailed classification report
    def confusion_matrix      # TP, TN, FP, FN breakdown
    def cross_validate()      # K-fold cross-validation
    def make_prediction()     # Generate predictions
    def save_model()          # Serialize model to disk

Pipeline Integration

The Classifier can include preprocessing steps:

self.pipeline = Pipeline([
    ('preprocessing', self.preprocessing),  # Optional StandardScaler, etc.
    ('model', self.model)                   # Random Forest
])

Currently, preprocessing=None (no scaling needed for tree-based models).

Performance Considerations

Computational Requirements

Training time: ~10-30 minutes on 20 cores for full nested CV
Memory: ~4-8 GB RAM for typical datasets
Parallelization: Uses n_jobs=-1 to leverage all available CPU cores

Optimization Tips

Reduce n_estimators: 200-300 trees often sufficient for faster training
Limit grid search: Focus on 2-3 key hyperparameters
Use warm_start: Resume training from previous model state
Sample data: Train on subset for rapid prototyping

For production use, the pretrained TRIFID model (trifid.v_1_0_4.pkl) is optimized and ready for immediate predictions without retraining.

Next Steps

Predictive Features

Explore the 45+ features that power TRIFID’s predictions

Interpretability

Understand how SHAP explains model decisions

Training Custom Models

Learn how to train TRIFID on your own data

API Reference

Complete API documentation for model classes

Get Started

Core Concepts

User Guides

TRIFID Modules

Data & Models

Model Architecture

Random Forest Architecture

Model Configuration

Hyperparameter Breakdown

n_estimators (400)

max_features (7)

min_samples_leaf (7)

class_weight (None)

random_state (123)

Model Selection Protocol

Architecture Overview

Grid Search Configuration

Hyperparameter Search Space

Alternative Models Considered

Model Evaluation Metrics

Primary Selection Metric: MCC

Training Pipeline

1. Model Selection Mode

2. Custom Model Mode

3. Pretrained Model Mode

Model Persistence

Classifier Class

Pipeline Integration

Performance Considerations

Computational Requirements

Optimization Tips

Next Steps

Predictive Features

Interpretability

Training Custom Models

API Reference

Build docs developers (and LLMs) love

Get Started

Core Concepts

User Guides

TRIFID Modules

Data & Models

​Random Forest Architecture

​Model Configuration

​Hyperparameter Breakdown

​n_estimators (400)

​max_features (7)

​min_samples_leaf (7)

​class_weight (None)

​random_state (123)

​Model Selection Protocol

​Architecture Overview

​Grid Search Configuration

Hyperparameter Search Space

​Alternative Models Considered

​Model Evaluation Metrics

​Primary Selection Metric: MCC

​Training Pipeline

​1. Model Selection Mode

​2. Custom Model Mode

​3. Pretrained Model Mode

​Model Persistence

​Classifier Class

Pipeline Integration

​Performance Considerations

​Computational Requirements

​Optimization Tips

​Next Steps

Predictive Features

Interpretability

Training Custom Models

API Reference

Build docs developers (and LLMs) love

Random Forest Architecture

Model Configuration

Hyperparameter Breakdown

n_estimators (400)

max_features (7)

min_samples_leaf (7)

class_weight (None)

random_state (123)

Model Selection Protocol

Architecture Overview

Grid Search Configuration

Alternative Models Considered

Model Evaluation Metrics

Primary Selection Metric: MCC

Training Pipeline

1. Model Selection Mode

2. Custom Model Mode

3. Pretrained Model Mode

Model Persistence

Classifier Class

Performance Considerations

Computational Requirements

Optimization Tips

Next Steps