select

Overview

The select module provides classes for model selection, training, and evaluation. It includes nested cross-validation for hyperparameter tuning, multiple classifier support, and comprehensive evaluation metrics.

Classes

Splitter

Base class providing train/test splitting functionality.

Constructor

from trifid.models.select import Splitter

splitter = Splitter(
    df=training_dataframe,
    features_col=feature_columns,
    target_col="label",
    random_state=123,
    test_size=0.25,
    split_by_gene=False
)

Parameters

pandas.DataFrame

required

Training dataset as pandas DataFrame

features_col

list

required

List of feature column names to use as independent variables

target_col

string

required

Name of target column to use as dependent variable

random_state

integer

required

Random seed for reproducibility

test_size

float

default:"0.25"

Proportion of dataset to use for testing (0.0 to 1.0)

split_by_gene

boolean

default:"False"

If True, ensures transcripts from the same gene are in the same split

Attributes

train_features

pandas.DataFrame

Training feature matrix

test_features

pandas.DataFrame

Test feature matrix

train_target

pandas.Series

Training target values

test_target

pandas.Series

Test target values

training_set

pandas.DataFrame

Complete training set with all columns

test_set

pandas.DataFrame

Complete test set with all columns

Classifier

Model training and evaluation wrapper that extends Splitter.

Constructor

from sklearn.ensemble import RandomForestClassifier
from trifid.models.select import Classifier

model = RandomForestClassifier(
    n_estimators=400,
    max_features=7,
    min_samples_leaf=7,
    random_state=123
)

classifier = Classifier(
    model=model,
    df=training_dataframe,
    features_col=feature_columns,
    target_col="label",
    random_state=123,
    test_size=0.25,
    preprocessing=None
)

Parameters

model

object

required

Scikit-learn model instance (e.g., RandomForestClassifier, GradientBoostingClassifier)

pandas.DataFrame

required

Training dataset as pandas DataFrame

features_col

list

required

List of feature column names to use as independent variables

target_col

string

required

Name of target column to use as dependent variable

random_state

integer

default:"123"

Random seed for reproducibility

test_size

float

default:"0.25"

Proportion of dataset to use for testing (0.0 to 1.0)

preprocessing

object

default:"None"

Optional preprocessing step (e.g., StandardScaler) to add to pipeline

Properties

evaluate

Returns comprehensive evaluation metrics on test set.

scores = classifier.evaluate
print(scores)

Returns: DataFrame with metrics:

Accuracy
AUC (Area Under ROC Curve)
Average Precision Score
Balanced Accuracy
F1 Score
Log Loss (negated)
MCC (Matthews Correlation Coefficient)
Precision
Recall

classification_report

Returns scikit-learn classification report.

report = classifier.classification_report
print(report)

confusion_matrix

Returns confusion matrix as DataFrame.

cm = classifier.confusion_matrix
print(cm)

Returns: DataFrame with columns: TN, FP, FN, TP

cross_validate

Performs stratified k-fold cross-validation.

cv_results = classifier.cross_validate(n_splits=5)
print(cv_results)

Returns: DataFrame with mean and standard deviation for each metric

Methods

make_prediction()

Generate predictions for new samples.

# Class predictions
predictions = classifier.make_prediction(
    samples=new_features,
    probability=False
)

# Probability predictions
probs = classifier.make_prediction(
    samples=new_features,
    probability=True
)

samples

pandas.DataFrame or numpy.ndarray

required

Feature matrix for new samples

probability

boolean

default:"False"

If True, returns probabilities; if False, returns class labels

save_model()

Save the trained model to disk.

classifier.save_model(
    outdir="models",
    name="custom_model.pkl"
)

outdir

string

required

Directory path to save model

name

string

default:"custom_model.pkl"

Filename for saved model

ModelSelection

Nested cross-validation for automated hyperparameter tuning and model selection.

Constructor

from trifid.models.select import ModelSelection

model_selection = ModelSelection(
    df=training_dataframe,
    features_col=feature_columns,
    target_col="label",
    random_state=123,
    n_outer_splits=5,
    n_inner_splits=10,
    n_jobs=20,
    save=False,
    filepath=None
)

Parameters

pandas.DataFrame

required

Training dataset as pandas DataFrame

features_col

list

required

List of feature column names to use as independent variables

target_col

string

required

Name of target column to use as dependent variable

random_state

integer

default:"123"

Random seed for reproducibility

n_outer_splits

integer

default:"5"

Number of folds in outer cross-validation loop

n_inner_splits

integer

default:"10"

Number of folds in inner cross-validation loop (for GridSearchCV)

n_jobs

integer

default:"20"

Number of parallel jobs for GridSearchCV

save

boolean

default:"False"

Whether to save model selection results

filepath

string

default:"None"

Path to save results (if save=True)

Methods

get_best_model()

Performs nested CV and returns the best model.

best_model = model_selection.get_best_model(
    outdir="models",
    selection_metric="MCC"
)

outdir

string

default:"None"

Directory to save model and results. If None, doesn’t save.

selection_metric

string

default:"MCC"

Metric to optimize during model selection. Options:

“MCC” (Matthews Correlation Coefficient)
“AUC”
“F1 Score”
“Balanced Accuracy”
“Accuracy”
“Precision”
“Recall”

Returns: Trained scikit-learn model object Process:

For each model configuration (Random Forest, Decision Tree, etc.)
Outer loop: Split data into train/validation
Inner loop: GridSearchCV for hyperparameter tuning
Evaluate best hyperparameters on validation set
Select model with best average validation performance
Retrain on full training set

save_model()

Save the selected model.

model_selection.save_model(outdir="models")

save_results()

Save model selection results to compressed TSV.

model_selection.save_results(outdir="models")

Supported Models

The ModelSelection class includes hyperparameter grids for:

Random Forest (Default)

RandomForestClassifier(
    n_estimators=400,
    random_state=123,
    n_jobs=-1
)

Tuned hyperparameters:

min_samples_leaf: [5, 6, 7, …, 14]

Decision Tree

DecisionTreeClassifier(random_state=123)

Tuned hyperparameters:

max_depth: [1, 2, 3, …, 9, None]
criterion: [“gini”, “entropy”]

Additional Models (Commented)

The module includes commented configurations for:

AdaBoost
Extremely Randomized Trees
Gradient Boosting Machine
K-Nearest Neighbors
Logistic Regression
Support Vector Machine
XGBoost

Uncomment and configure as needed.

Evaluation Metrics

All metrics are computed on the test/validation set:

Accuracy

float

(TP + TN) / (TP + TN + FP + FN)

AUC

float

Area under the ROC curve

Average Precision Score

float

Area under the precision-recall curve

Balanced Accuracy

float

Average of recall for each class

F1 Score

float

Harmonic mean of precision and recall

Log Loss

float

Negative log-likelihood (negated for consistency)

MCC

float

Matthews Correlation Coefficient (-1 to 1)

Precision

float

TP / (TP + FP)

Recall

float

TP / (TP + FN), also called Sensitivity

Complete Example

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from trifid.models.select import Classifier, ModelSelection

# Load training data
df_training = pd.read_csv("data/training_set.tsv", sep="\t")
feature_cols = [col for col in df_training.columns 
                if col not in ["label", "gene_id", "transcript_id"]]

# Option 1: Train a custom model
model = RandomForestClassifier(
    n_estimators=400,
    max_features=7,
    min_samples_leaf=7,
    random_state=123
)

classifier = Classifier(
    model=model,
    df=df_training,
    features_col=feature_cols,
    target_col="label",
    random_state=123,
    test_size=0.25
)

# Evaluate
print("Test Set Evaluation:")
print(classifier.evaluate)

print("\nCross-Validation:")
print(classifier.cross_validate(n_splits=5))

print("\nConfusion Matrix:")
print(classifier.confusion_matrix)

# Save model
classifier.save_model(outdir="models", name="custom_rf.pkl")

# Option 2: Automated model selection
model_selection = ModelSelection(
    df=df_training,
    features_col=feature_cols,
    target_col="label",
    random_state=123,
    n_outer_splits=5,
    n_inner_splits=10,
    n_jobs=20
)

best_model = model_selection.get_best_model(
    outdir="models",
    selection_metric="MCC"
)

print("\nBest Model Selected:")
print(best_model)

Gene-Based Splitting

For preventing data leakage when transcripts from the same gene are correlated:

splitter = Splitter(
    df=df_training,
    features_col=feature_cols,
    target_col="label",
    random_state=123,
    test_size=0.25,
    split_by_gene=True  # Ensures no gene appears in both train and test
)

This ensures that:

All transcripts from a gene are in either training or test set
No gene leakage between splits
More realistic evaluation of generalization

Best Practices

Use MCC for imbalanced datasets: More robust than accuracy
Nested CV for small datasets: Provides unbiased performance estimates
Gene-based splitting: When transcripts are correlated within genes
Save models and results: For reproducibility and future use
Multiple metrics: Don’t rely on a single metric

Output Files

When using ModelSelection.get_best_model(outdir="models"):

selected_model.pkl: Serialized best model
model_selection_TIMESTAMP.tsv.gz: Detailed results for all models and folds
model_selection_TIMESTAMP.log: Training log with nested CV progress

train: Model training workflows
predict: Generate predictions
interpret: Model interpretation and feature importance

Preprocessing

Models

Data

Utils

Visualization

Overview

Classes

Splitter

Constructor

Parameters

Attributes

Classifier

Constructor

Parameters

Properties

evaluate

classification_report

confusion_matrix

cross_validate

Methods

make_prediction()

save_model()

ModelSelection

Constructor

Parameters

Methods

get_best_model()

save_model()

save_results()

Supported Models

Random Forest (Default)

Decision Tree

Additional Models (Commented)

Evaluation Metrics

Complete Example

Gene-Based Splitting

Best Practices

Output Files

Build docs developers (and LLMs) love

Preprocessing

Models

Data

Utils

Visualization

​Overview

​Classes

​Splitter

​Constructor

​Parameters

​Attributes

​Classifier

​Constructor

​Parameters

​Properties

evaluate

classification_report

confusion_matrix

cross_validate

​Methods

make_prediction()

save_model()

​ModelSelection

​Constructor

​Parameters

​Methods

get_best_model()

save_model()

save_results()

​Supported Models

​Random Forest (Default)

​Decision Tree

​Additional Models (Commented)

​Evaluation Metrics

​Complete Example

​Gene-Based Splitting

​Best Practices

​Output Files

​Related Modules

Build docs developers (and LLMs) love

Overview

Classes

Splitter

Constructor

Parameters

Attributes

Classifier

Constructor

Parameters

Properties

Methods

ModelSelection

Constructor

Parameters

Methods

Supported Models

Random Forest (Default)

Decision Tree

Additional Models (Commented)

Evaluation Metrics

Complete Example

Gene-Based Splitting

Best Practices

Output Files

Related Modules