Once you have a trained model and prepared data, you can generate TRIFID scores for all transcripts in your genome. This guide covers the prediction workflow and score interpretation.
Overview
The prediction process:
Load a trained TRIFID model
Prepare your feature database
Handle missing values per genome assembly
Generate predictions and confidence scores
Export results for downstream analysis
Quick Start
Generate predictions for human genome (GRCh38):
python -m trifid.models.predict \
--config config/config.yaml \
--features config/features.yaml \
--model models/trifid.v_1_0_4.pkl \
--assembly GRCh38 \
--release g27
This creates trifid_predictions.tsv.gz in your data directory.
Loading a Trained Model
TRIFID supports multiple ways to load models.
Using a Saved Model File
The prediction script loads serialized models:
# From trifid/models/predict.py:176
import pickle
model = pickle.load( open (args.model, "rb" ))
Available Models
Default TRIFID models:
trifid.v_1_0_4.pkl: Latest version trained on human data
trifid.v_1_0_0.pkl: Original published model
Your custom models:
models/custom_model.pkl: From --custom training
models/selected_model.pkl: From --model_selection
TRIFID models are scikit-learn RandomForestClassifier objects saved with pickle.
Preparing Features
The prediction script needs to know which features your model expects.
Loading Feature Configuration
# From trifid/models/predict.py:174-178
config = parse_yaml(args.config)
df_features = pd.DataFrame(parse_yaml(args.features))
# Extract feature names (exclude identifiers)
features = df_features[
~ df_features[ "category" ].str.contains( "Identifier" )
][ "feature" ].values
# Extract identifier columns
ids = df_features[
df_features[ "category" ].str.contains( "Identifier" )
][ "feature" ].values
Required vs Optional Features
Always required:
Transcript identifiers (transcript_id, gene_id, gene_name)
Structural features (length)
At least one scoring feature
Commonly used:
norm_spade: APPRIS domain integrity
norm_RNA2sj_cds: Splice junction support
pfam_score: Pfam domain conservation
length_delta_score: Length relative to reference
The features used for prediction must exactly match those used during training.
Handling Missing Values
Different genome assemblies have different data availability. TRIFID handles this intelligently.
Assembly-Specific Strategies
The prediction script implements specialized handling per assembly:
Human GRCh38 (GENCODE)
Human GRCh37 (GENCODE)
Mouse GRCm39
Other organisms
# From trifid/models/predict.py:43-45
if assembly == "GRCh38" and release.startswith( "g" ):
pass # No special handling needed - all features available
Supported Assemblies
TRIFID has built-in support for:
Organism Assembly Annotation Notes Human GRCh38 GENCODE Full feature support Human GRCh37 GENCODE Legacy support Human GRCh38 RefSeq Requires feature adjustments Mouse GRCm39 GENCODE No RNA2sj for CDS Mouse GRCm38 GENCODE Legacy support Rat Rnor_6.0 Ensembl Uses ACDS instead of CCDS Zebrafish GRCz11 Ensembl Limited features Pig Sscrofa11.1 Ensembl Limited features Chimp Pan_tro_3.0 Ensembl Limited features Chicken GRCg6a Ensembl Limited features Cow ARS-UCD1.2 Ensembl Limited features Fly BDGP6 Ensembl Limited features Worm WBcel235 Ensembl All features set to 0 if missing
Generating Predictions
The core prediction function generates scores for all transcripts.
Main Prediction Function
# From trifid/models/predict.py:32-40
def make_predictions (
features : list ,
ids : list ,
config : dict ,
model : object ,
assembly : str ,
release : str
):
# Load TRIFID database
data_dir = os.path.join(
"data" , "genomes" ,
config[ "annotation" ][ "genome_version" ],
config[ "annotation" ][ "db" ]
)
df = pd.read_csv(
os.path.join(data_dir, "trifid_db.tsv.gz" ),
sep = " \t " , compression = "gzip"
)
Score Calculation
TRIFID generates both raw and normalized scores:
# From trifid/utils/utils.py:267-296
def generate_trifid_metrics (
df : pd.DataFrame,
features : pd.DataFrame,
model : object ,
nmax_norm_median : bool = False
) -> pd.DataFrame:
# Generate probability scores
df[ "trifid_score" ] = model.predict_proba(features)[:, 1 ]
# Normalize scores per gene
df[ "norm_trifid_score" ] = df.groupby( "gene_id" )[ "trifid_score" ].transform(
lambda x : 0 if (x == 0 ).all()
else ( 1 if (( len ( set (x)) == 1 ) & (x >= 0.5 ).all())
else (x) / ( max ( 0.5 , x.max())))
)
# Round to 4 decimal places
df = df.round({ "trifid_score" : 4 , "norm_trifid_score" : 4 })
return df
Raw prediction
Model outputs probability that transcript is functional (0-1 scale).
Gene-level normalization
Scores are normalized within each gene to highlight the most functional isoform.
Edge case handling
All zeros → Keep as 0
All high scores → Keep as 1
Otherwise → Normalize to [0, 1]
Predictions are saved as a compressed TSV file.
Output Columns
The final predictions file contains:
# From trifid/models/predict.py:139-151
labels = [
"gene_id" ,
"gene_name" ,
"transcript_id" ,
"translation_id" ,
"flags" ,
"ccdsid" ,
"appris" ,
"ann_type" ,
"length" ,
"trifid_score" ,
"norm_trifid_score" ,
]
Example Output
gene_id gene_name transcript_id trifid_score norm_trifid_score
ENSG00000139618 BRCA2 ENST00000380152 0.9234 1.0000
ENSG00000139618 BRCA2 ENST00000544455 0.2341 0.2535
ENSG00000139618 BRCA2 ENST00000614259 0.7823 0.8473
ENSG00000141510 TP53 ENST00000269305 0.8912 1.0000
ENSG00000141510 TP53 ENST00000420246 0.3421 0.3839
File Location
# From trifid/models/predict.py:152-154
df_predictions[labels].to_csv(
os.path.join(data_dir, "trifid_predictions.tsv.gz" ),
index = None , sep = " \t " , compression = "gzip"
)
Default path: data/genomes/{assembly}/{release}/trifid_predictions.tsv.gz
Score Interpretation
TRIFID Score (Raw)
Range: 0.0 - 1.0
Interpretation:
> 0.8 : High confidence functional
0.5 - 0.8 : Likely functional
0.2 - 0.5 : Uncertain/context-dependent
< 0.2 : Likely non-functional
Use cases:
Comparing isoforms across different genes
Setting genome-wide functional cutoffs
Prioritizing isoforms for experimental validation
Normalized TRIFID Score (Gene-Relative)
Range: 0.0 - 1.0
Interpretation:
1.0 : Most functional isoform of the gene
0.5 - 0.99 : Partially functional isoform
< 0.5 : Less functional relative to gene’s main isoform
Use cases:
Identifying the principal functional isoform per gene
Quantifying isoform switching effects
Gene-centric functional analyses
For most analyses, use norm_trifid_score to compare isoforms within a gene, and trifid_score to compare across genes.
Batch Predictions
For multiple genomes or conditions:
Scripted Workflow
#!/bin/bash
# predict_all.sh
GENOMES = ( "GRCh38" "GRCh37" "GRCm39" )
RELEASES = ( "g27" "g27" "gM25" )
for i in ${ ! GENOMES [ @ ]}; do
echo "Processing ${ GENOMES [ $i ]} ${ RELEASES [ $i ]}..."
python -m trifid.models.predict \
--config config/config.yaml \
--features config/features.yaml \
--model models/trifid.v_1_0_4.pkl \
--assembly ${ GENOMES [ $i ]} \
--release ${ RELEASES [ $i ]}
echo "Done: ${ GENOMES [ $i ]}"
done
Python Script
# batch_predict.py
import os
from trifid.models.predict import make_predictions
from trifid.utils.utils import parse_yaml
import pickle
import pandas as pd
# Configuration
config = parse_yaml( 'config/config.yaml' )
df_features = pd.DataFrame(parse_yaml( 'config/features.yaml' ))
model = pickle.load( open ( 'models/trifid.v_1_0_4.pkl' , 'rb' ))
features = df_features[
~ df_features[ "category" ].str.contains( "Identifier" )
][ "feature" ].values
ids = df_features[
df_features[ "category" ].str.contains( "Identifier" )
][ "feature" ].values
# Run predictions
genomes = [
( "GRCh38" , "g27" ),
( "GRCm39" , "gM25" ),
]
for assembly, release in genomes:
print ( f "Predicting { assembly } { release } ..." )
make_predictions(features, ids, config, model, assembly, release)
print ( f "Complete: { assembly } " )
Integrating with Workflows
Loading Predictions in Python
import pandas as pd
# Load predictions
df = pd.read_csv(
'data/genomes/GRCh38/g27/trifid_predictions.tsv.gz' ,
sep = ' \t ' ,
compression = 'gzip'
)
# Filter functional isoforms
functional = df[df[ 'trifid_score' ] > 0.8 ]
# Get top isoform per gene
top_isoforms = df.loc[
df.groupby( 'gene_id' )[ 'norm_trifid_score' ].idxmax()
]
print ( f "Total transcripts: { len (df) } " )
print ( f "Functional (>0.8): { len (functional) } " )
print ( f "Top isoforms: { len (top_isoforms) } " )
Downstream Analysis
# Compare TRIFID scores with APPRIS annotations
import seaborn as sns
import matplotlib.pyplot as plt
principal = df[df[ 'appris' ].str.contains( 'PRINCIPAL' , na = False )]
alternative = df[ ~ df[ 'appris' ].str.contains( 'PRINCIPAL' , na = False )]
plt.figure( figsize = ( 10 , 6 ))
sns.boxplot( data = [
principal[ 'trifid_score' ],
alternative[ 'trifid_score' ]
], labels = [ 'PRINCIPAL' , 'ALTERNATIVE' ])
plt.ylabel( 'TRIFID Score' )
plt.title( 'TRIFID Scores by APPRIS Annotation' )
plt.show()
Export for IGV
Create a BED file colored by TRIFID score:
# Load GTF for coordinates
from trifid.data.loaders import load_annotation
df_gtf = load_annotation(
'data/genomes/GRCh38/g27/gencode.v27.annotation.gtf.gz' ,
db = 'g'
)
# Merge with predictions
df_viz = pd.merge(
df_gtf[[ 'transcript_id' , 'seqname' , 'start' , 'end' , 'strand' ]],
df[[ 'transcript_id' , 'trifid_score' ]],
on = 'transcript_id'
)
# Convert score to RGB color (0=red, 1=green)
df_viz[ 'color' ] = df_viz[ 'trifid_score' ].apply(
lambda x : f " { int (( 1 - x) * 255 ) } , { int (x * 255 ) } ,0"
)
# Export BED
df_viz[[ 'seqname' , 'start' , 'end' , 'transcript_id' ,
'trifid_score' , 'strand' , 'start' , 'end' , 'color' ]].to_csv(
'trifid_scores.bed' ,
sep = ' \t ' ,
header = False ,
index = False
)
Speed Optimization
For large genomes:
Predictions are fast (seconds to minutes)
Bottleneck is usually I/O, not computation
Use SSD storage for data files
Timing examples:
Human genome (~200k transcripts): ~2-5 minutes
Mouse genome (~150k transcripts): ~1-3 minutes
Memory Usage
Typical requirements:
Human genome: ~2-4 GB RAM
Mouse genome: ~1-2 GB RAM
If you encounter memory errors:
Process chromosomes separately
Use chunked reading with pandas
Reduce number of features loaded
Troubleshooting
Model Loading Errors
Error: pickle.UnpicklingError
Cause: Model file corrupted or from incompatible scikit-learn version
Solution:
# Check scikit-learn version
python -c "import sklearn; print(sklearn.__version__)"
# Retrain model with current version
python -m trifid.models.train --features config/features.yaml --custom
Feature Mismatch Errors
Error: KeyError: 'norm_RNA2sj_cds'
Cause: Feature in model but not in database
Solution:
Check features.yaml matches training configuration
Verify all preprocessing steps completed
Use --custom mode with matching features
Missing Predictions
Problem: Some transcripts have NaN scores
Cause: Missing values in required features
Solution:
# Check for missing values
df = pd.read_csv( 'data/genomes/GRCh38/g27/trifid_db.tsv.gz' , sep = ' \t ' )
print (df[features].isnull().sum())
# Impute or remove
df[features] = df[features].fillna( - 1 ) # or
df = df.dropna( subset = features)
Score Distribution Issues
Problem: All scores near 0.5
Cause: Model not confident, possibly due to:
Poor training
Feature distribution shift
Missing key features
Solution:
Retrain with more data
Check feature distributions match training data
Use model interpretation (see next guide)
Next Steps
Interpret Results Understand TRIFID scores with SHAP explanations
API Reference Detailed API documentation for predictions