Overview
Thepredict module generates TRIFID scores for transcript isoforms using a trained model. It handles multiple genome assemblies and releases, with automatic feature preprocessing and missing value handling.
Command-Line Usage
Command-Line Arguments
Genome assembly version. Supported assemblies:
- Human:
GRCh38,GRCh37 - Mouse:
GRCm39,GRCm38 - Rat:
Rnor_6.0 - Zebrafish:
GRCz11 - Pig:
Sscrofa11.1 - Chimpanzee:
Pan_tro_3.0 - Chicken:
GRCg6a - Cow:
ARS-UCD1.2 - Fly:
BDGP6 - Worm:
WBcel235
Path to configuration file containing genome and annotation settings
Path to features selected description YAML file
Path to pretrained model pickle file
Genome release version (e.g., g27, r110, e104)
Core Function
make_predictions()
Generates TRIFID predictions for transcript isoforms.Parameters
List of feature column names to use for prediction
List of identifier column names (e.g., gene_id, transcript_id)
Configuration dictionary containing:
annotation.genome_version: Genome assembly versionannotation.db: Database directory name
Trained scikit-learn model object loaded from pickle file
Genome assembly version identifier
Genome release identifier
Returns
Writes predictions totrifid_predictions.tsv.gz with the following columns:
Ensembl gene identifier
Gene symbol/name
Ensembl transcript identifier
Ensembl protein identifier
Annotation flags
CCDS identifier (if available)
APPRIS annotation
Annotation type
Transcript length
Raw TRIFID score (probability from 0-1)
Normalized TRIFID score within gene
Assembly-Specific Preprocessing
The function automatically handles missing values and feature differences across assemblies:Human Assemblies
GRCh38 (Gencode):- No special preprocessing required
- All features available
- Fills missing values with -1
- Fills RefSeq-unavailable features with -1
- Handles CCDS binary encoding
- Fills corsair features with 0
- Fills RNA2sj features with -1 (GRCh37 only)
Model Organism Assemblies
Mouse (GRCm39, GRCm38):- Fills RNA2sj features with -1
- Removes CCDS feature
- Loads ACDS (Alternative CCDS) data
- Fills RNA2sj features with -1
- Fills all missing values with -1
- Fills all missing values with -1 or 0
Example Workflows
Basic Prediction
Predict for Multiple Assemblies
Custom Model Path
Input Requirements
TRIFID Database
The input database (trifid_db.tsv.gz) must contain:
- Identifier columns: gene_id, gene_name, transcript_id, translation_id, ccdsid
- Feature columns: All features specified in features YAML
- Annotation columns: flags, appris, ann_type
Configuration File
The config YAML must specify:Features File
The features YAML must include:Output Format
Predictions are saved to:Performance Considerations
- The function processes the entire TRIFID database for the specified assembly
- Memory usage depends on genome size (human: ~200K transcripts)
- Prediction time: ~1-10 seconds for human genome with RandomForest model
- Output file size: ~5-15 MB compressed