Skip to main content

Overview

The select module provides classes for model selection, training, and evaluation. It includes nested cross-validation for hyperparameter tuning, multiple classifier support, and comprehensive evaluation metrics.

Classes

Splitter

Base class providing train/test splitting functionality.

Constructor

from trifid.models.select import Splitter

splitter = Splitter(
    df=training_dataframe,
    features_col=feature_columns,
    target_col="label",
    random_state=123,
    test_size=0.25,
    split_by_gene=False
)

Parameters

df
pandas.DataFrame
required
Training dataset as pandas DataFrame
features_col
list
required
List of feature column names to use as independent variables
target_col
string
required
Name of target column to use as dependent variable
random_state
integer
required
Random seed for reproducibility
test_size
float
default:"0.25"
Proportion of dataset to use for testing (0.0 to 1.0)
split_by_gene
boolean
default:"False"
If True, ensures transcripts from the same gene are in the same split

Attributes

train_features
pandas.DataFrame
Training feature matrix
test_features
pandas.DataFrame
Test feature matrix
train_target
pandas.Series
Training target values
test_target
pandas.Series
Test target values
training_set
pandas.DataFrame
Complete training set with all columns
test_set
pandas.DataFrame
Complete test set with all columns

Classifier

Model training and evaluation wrapper that extends Splitter.

Constructor

from sklearn.ensemble import RandomForestClassifier
from trifid.models.select import Classifier

model = RandomForestClassifier(
    n_estimators=400,
    max_features=7,
    min_samples_leaf=7,
    random_state=123
)

classifier = Classifier(
    model=model,
    df=training_dataframe,
    features_col=feature_columns,
    target_col="label",
    random_state=123,
    test_size=0.25,
    preprocessing=None
)

Parameters

model
object
required
Scikit-learn model instance (e.g., RandomForestClassifier, GradientBoostingClassifier)
df
pandas.DataFrame
required
Training dataset as pandas DataFrame
features_col
list
required
List of feature column names to use as independent variables
target_col
string
required
Name of target column to use as dependent variable
random_state
integer
default:"123"
Random seed for reproducibility
test_size
float
default:"0.25"
Proportion of dataset to use for testing (0.0 to 1.0)
preprocessing
object
default:"None"
Optional preprocessing step (e.g., StandardScaler) to add to pipeline

Properties

evaluate
Returns comprehensive evaluation metrics on test set.
scores = classifier.evaluate
print(scores)
Returns: DataFrame with metrics:
  • Accuracy
  • AUC (Area Under ROC Curve)
  • Average Precision Score
  • Balanced Accuracy
  • F1 Score
  • Log Loss (negated)
  • MCC (Matthews Correlation Coefficient)
  • Precision
  • Recall
classification_report
Returns scikit-learn classification report.
report = classifier.classification_report
print(report)
confusion_matrix
Returns confusion matrix as DataFrame.
cm = classifier.confusion_matrix
print(cm)
Returns: DataFrame with columns: TN, FP, FN, TP
cross_validate
Performs stratified k-fold cross-validation.
cv_results = classifier.cross_validate(n_splits=5)
print(cv_results)
Returns: DataFrame with mean and standard deviation for each metric

Methods

make_prediction()
Generate predictions for new samples.
# Class predictions
predictions = classifier.make_prediction(
    samples=new_features,
    probability=False
)

# Probability predictions
probs = classifier.make_prediction(
    samples=new_features,
    probability=True
)
samples
pandas.DataFrame or numpy.ndarray
required
Feature matrix for new samples
probability
boolean
default:"False"
If True, returns probabilities; if False, returns class labels
save_model()
Save the trained model to disk.
classifier.save_model(
    outdir="models",
    name="custom_model.pkl"
)
outdir
string
required
Directory path to save model
name
string
default:"custom_model.pkl"
Filename for saved model

ModelSelection

Nested cross-validation for automated hyperparameter tuning and model selection.

Constructor

from trifid.models.select import ModelSelection

model_selection = ModelSelection(
    df=training_dataframe,
    features_col=feature_columns,
    target_col="label",
    random_state=123,
    n_outer_splits=5,
    n_inner_splits=10,
    n_jobs=20,
    save=False,
    filepath=None
)

Parameters

df
pandas.DataFrame
required
Training dataset as pandas DataFrame
features_col
list
required
List of feature column names to use as independent variables
target_col
string
required
Name of target column to use as dependent variable
random_state
integer
default:"123"
Random seed for reproducibility
n_outer_splits
integer
default:"5"
Number of folds in outer cross-validation loop
n_inner_splits
integer
default:"10"
Number of folds in inner cross-validation loop (for GridSearchCV)
n_jobs
integer
default:"20"
Number of parallel jobs for GridSearchCV
save
boolean
default:"False"
Whether to save model selection results
filepath
string
default:"None"
Path to save results (if save=True)

Methods

get_best_model()
Performs nested CV and returns the best model.
best_model = model_selection.get_best_model(
    outdir="models",
    selection_metric="MCC"
)
outdir
string
default:"None"
Directory to save model and results. If None, doesn’t save.
selection_metric
string
default:"MCC"
Metric to optimize during model selection. Options:
  • “MCC” (Matthews Correlation Coefficient)
  • “AUC”
  • “F1 Score”
  • “Balanced Accuracy”
  • “Accuracy”
  • “Precision”
  • “Recall”
Returns: Trained scikit-learn model object Process:
  1. For each model configuration (Random Forest, Decision Tree, etc.)
  2. Outer loop: Split data into train/validation
  3. Inner loop: GridSearchCV for hyperparameter tuning
  4. Evaluate best hyperparameters on validation set
  5. Select model with best average validation performance
  6. Retrain on full training set
save_model()
Save the selected model.
model_selection.save_model(outdir="models")
save_results()
Save model selection results to compressed TSV.
model_selection.save_results(outdir="models")

Supported Models

The ModelSelection class includes hyperparameter grids for:

Random Forest (Default)

RandomForestClassifier(
    n_estimators=400,
    random_state=123,
    n_jobs=-1
)
Tuned hyperparameters:
  • min_samples_leaf: [5, 6, 7, …, 14]

Decision Tree

DecisionTreeClassifier(random_state=123)
Tuned hyperparameters:
  • max_depth: [1, 2, 3, …, 9, None]
  • criterion: [“gini”, “entropy”]

Additional Models (Commented)

The module includes commented configurations for:
  • AdaBoost
  • Extremely Randomized Trees
  • Gradient Boosting Machine
  • K-Nearest Neighbors
  • Logistic Regression
  • Support Vector Machine
  • XGBoost
Uncomment and configure as needed.

Evaluation Metrics

All metrics are computed on the test/validation set:
Accuracy
float
(TP + TN) / (TP + TN + FP + FN)
AUC
float
Area under the ROC curve
Average Precision Score
float
Area under the precision-recall curve
Balanced Accuracy
float
Average of recall for each class
F1 Score
float
Harmonic mean of precision and recall
Log Loss
float
Negative log-likelihood (negated for consistency)
MCC
float
Matthews Correlation Coefficient (-1 to 1)
Precision
float
TP / (TP + FP)
Recall
float
TP / (TP + FN), also called Sensitivity

Complete Example

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from trifid.models.select import Classifier, ModelSelection

# Load training data
df_training = pd.read_csv("data/training_set.tsv", sep="\t")
feature_cols = [col for col in df_training.columns 
                if col not in ["label", "gene_id", "transcript_id"]]

# Option 1: Train a custom model
model = RandomForestClassifier(
    n_estimators=400,
    max_features=7,
    min_samples_leaf=7,
    random_state=123
)

classifier = Classifier(
    model=model,
    df=df_training,
    features_col=feature_cols,
    target_col="label",
    random_state=123,
    test_size=0.25
)

# Evaluate
print("Test Set Evaluation:")
print(classifier.evaluate)

print("\nCross-Validation:")
print(classifier.cross_validate(n_splits=5))

print("\nConfusion Matrix:")
print(classifier.confusion_matrix)

# Save model
classifier.save_model(outdir="models", name="custom_rf.pkl")

# Option 2: Automated model selection
model_selection = ModelSelection(
    df=df_training,
    features_col=feature_cols,
    target_col="label",
    random_state=123,
    n_outer_splits=5,
    n_inner_splits=10,
    n_jobs=20
)

best_model = model_selection.get_best_model(
    outdir="models",
    selection_metric="MCC"
)

print("\nBest Model Selected:")
print(best_model)

Gene-Based Splitting

For preventing data leakage when transcripts from the same gene are correlated:
splitter = Splitter(
    df=df_training,
    features_col=feature_cols,
    target_col="label",
    random_state=123,
    test_size=0.25,
    split_by_gene=True  # Ensures no gene appears in both train and test
)
This ensures that:
  • All transcripts from a gene are in either training or test set
  • No gene leakage between splits
  • More realistic evaluation of generalization

Best Practices

  1. Use MCC for imbalanced datasets: More robust than accuracy
  2. Nested CV for small datasets: Provides unbiased performance estimates
  3. Gene-based splitting: When transcripts are correlated within genes
  4. Save models and results: For reproducibility and future use
  5. Multiple metrics: Don’t rely on a single metric

Output Files

When using ModelSelection.get_best_model(outdir="models"):
  • selected_model.pkl: Serialized best model
  • model_selection_TIMESTAMP.tsv.gz: Detailed results for all models and folds
  • model_selection_TIMESTAMP.log: Training log with nested CV progress
  • train: Model training workflows
  • predict: Generate predictions
  • interpret: Model interpretation and feature importance

Build docs developers (and LLMs) love