Overview
Theselect module provides classes for model selection, training, and evaluation. It includes nested cross-validation for hyperparameter tuning, multiple classifier support, and comprehensive evaluation metrics.
Classes
Splitter
Base class providing train/test splitting functionality.Constructor
Parameters
Training dataset as pandas DataFrame
List of feature column names to use as independent variables
Name of target column to use as dependent variable
Random seed for reproducibility
Proportion of dataset to use for testing (0.0 to 1.0)
If True, ensures transcripts from the same gene are in the same split
Attributes
Training feature matrix
Test feature matrix
Training target values
Test target values
Complete training set with all columns
Complete test set with all columns
Classifier
Model training and evaluation wrapper that extends Splitter.Constructor
Parameters
Scikit-learn model instance (e.g., RandomForestClassifier, GradientBoostingClassifier)
Training dataset as pandas DataFrame
List of feature column names to use as independent variables
Name of target column to use as dependent variable
Random seed for reproducibility
Proportion of dataset to use for testing (0.0 to 1.0)
Optional preprocessing step (e.g., StandardScaler) to add to pipeline
Properties
evaluate
Returns comprehensive evaluation metrics on test set.- Accuracy
- AUC (Area Under ROC Curve)
- Average Precision Score
- Balanced Accuracy
- F1 Score
- Log Loss (negated)
- MCC (Matthews Correlation Coefficient)
- Precision
- Recall
classification_report
Returns scikit-learn classification report.confusion_matrix
Returns confusion matrix as DataFrame.cross_validate
Performs stratified k-fold cross-validation.Methods
make_prediction()
Generate predictions for new samples.Feature matrix for new samples
If True, returns probabilities; if False, returns class labels
save_model()
Save the trained model to disk.Directory path to save model
Filename for saved model
ModelSelection
Nested cross-validation for automated hyperparameter tuning and model selection.Constructor
Parameters
Training dataset as pandas DataFrame
List of feature column names to use as independent variables
Name of target column to use as dependent variable
Random seed for reproducibility
Number of folds in outer cross-validation loop
Number of folds in inner cross-validation loop (for GridSearchCV)
Number of parallel jobs for GridSearchCV
Whether to save model selection results
Path to save results (if save=True)
Methods
get_best_model()
Performs nested CV and returns the best model.Directory to save model and results. If None, doesn’t save.
Metric to optimize during model selection. Options:
- “MCC” (Matthews Correlation Coefficient)
- “AUC”
- “F1 Score”
- “Balanced Accuracy”
- “Accuracy”
- “Precision”
- “Recall”
- For each model configuration (Random Forest, Decision Tree, etc.)
- Outer loop: Split data into train/validation
- Inner loop: GridSearchCV for hyperparameter tuning
- Evaluate best hyperparameters on validation set
- Select model with best average validation performance
- Retrain on full training set
save_model()
Save the selected model.save_results()
Save model selection results to compressed TSV.Supported Models
TheModelSelection class includes hyperparameter grids for:
Random Forest (Default)
min_samples_leaf: [5, 6, 7, …, 14]
Decision Tree
max_depth: [1, 2, 3, …, 9, None]criterion: [“gini”, “entropy”]
Additional Models (Commented)
The module includes commented configurations for:- AdaBoost
- Extremely Randomized Trees
- Gradient Boosting Machine
- K-Nearest Neighbors
- Logistic Regression
- Support Vector Machine
- XGBoost
Evaluation Metrics
All metrics are computed on the test/validation set:(TP + TN) / (TP + TN + FP + FN)
Area under the ROC curve
Area under the precision-recall curve
Average of recall for each class
Harmonic mean of precision and recall
Negative log-likelihood (negated for consistency)
Matthews Correlation Coefficient (-1 to 1)
TP / (TP + FP)
TP / (TP + FN), also called Sensitivity
Complete Example
Gene-Based Splitting
For preventing data leakage when transcripts from the same gene are correlated:- All transcripts from a gene are in either training or test set
- No gene leakage between splits
- More realistic evaluation of generalization
Best Practices
- Use MCC for imbalanced datasets: More robust than accuracy
- Nested CV for small datasets: Provides unbiased performance estimates
- Gene-based splitting: When transcripts are correlated within genes
- Save models and results: For reproducibility and future use
- Multiple metrics: Don’t rely on a single metric
Output Files
When usingModelSelection.get_best_model(outdir="models"):
- selected_model.pkl: Serialized best model
- model_selection_TIMESTAMP.tsv.gz: Detailed results for all models and folds
- model_selection_TIMESTAMP.log: Training log with nested CV progress