Prerequisites
Before starting, ensure you have:
Installed TRIFID (see Installation )
Downloaded pre-computed predictions for your genome of interest
Python 3.7+ with pandas installed
This guide uses GENCODE 27 (GRCh38) human predictions as an example. The same workflow applies to other genomes.
Loading Predictions
The fastest way to get started is loading pre-computed TRIFID predictions.
Load Predictions for All Genes
import pandas as pd
# Load TRIFID predictions
predictions = pd.read_csv(
'data/genomes/GRCh38/g27/trifid_predictions.tsv.gz' ,
compression = 'gzip' ,
sep = ' \t '
)
# View first few rows
predictions.head()
Understanding the Output
The predictions file contains the following key columns:
Column Description gene_idEnsembl gene identifier gene_nameGene symbol (e.g., FGFR1) transcript_idEnsembl transcript identifier translation_idEnsembl protein identifier apprisAPPRIS annotation label lengthProtein length (amino acids) trifid_scoreRaw TRIFID score (0-1) norm_trifid_scoreGene-normalized score (0-1)
The norm_trifid_score is particularly useful for identifying the principal isoform within each gene.
Example 1: Analyze a Single Gene
Let’s examine the isoforms of FGFR1 (Fibroblast Growth Factor Receptor 1):
# Select gene of interest
gene_name = 'FGFR1'
# Filter predictions
fgfr1_isoforms = predictions.loc[
predictions[ 'gene_name' ] == gene_name,
[ 'transcript_id' , 'gene_name' , 'appris' , 'length' , 'trifid_score' , 'norm_trifid_score' ]
].sort_values( 'trifid_score' , ascending = False )
print (fgfr1_isoforms)
Expected Output
transcript_id gene_name appris length trifid_score norm_trifid_score
ENST00000447712 FGFR1 PRINCIPAL:3 822 0.87 0.99
ENST00000356207 FGFR1 MINOR 733 0.60 0.69
ENST00000397103 FGFR1 MINOR 733 0.01 0.08
ENST00000619564 FGFR1 MINOR 228 0.00 0.01
Interpretation
Identify Principal Isoform
ENST00000447712 has the highest TRIFID score (0.87) and normalized score (0.99), confirming it as the functionally important isoform.
Secondary Isoforms
ENST00000356207 has moderate scores (0.60/0.69), suggesting potential functional relevance in specific contexts.
Low-Scoring Isoforms
The other isoforms have very low scores (below 0.1), likely representing transcriptional noise or non-functional variants.
TRIFID scores align well with APPRIS annotations. High-scoring isoforms typically correspond to PRINCIPAL annotations.
Example 2: Genome-Wide Analysis
# Define high-confidence threshold
HIGH_CONFIDENCE_THRESHOLD = 0.7
# Count high-scoring isoforms per gene
high_confidence = predictions[predictions[ 'trifid_score' ] >= HIGH_CONFIDENCE_THRESHOLD ]
isoforms_per_gene = high_confidence.groupby( 'gene_name' ).size()
print ( f "Genes with 1 high-confidence isoform: { (isoforms_per_gene == 1 ).sum() } " )
print ( f "Genes with 2+ high-confidence isoforms: { (isoforms_per_gene > 1 ).sum() } " )
Compare TRIFID with APPRIS
import matplotlib.pyplot as plt
import seaborn as sns
# Group by APPRIS annotation
appris_comparison = predictions.groupby( 'appris' )[ 'trifid_score' ].describe()
print (appris_comparison)
# Visualize
sns.boxplot( data = predictions, x = 'appris' , y = 'trifid_score' )
plt.xticks( rotation = 45 )
plt.title( 'TRIFID Scores by APPRIS Annotation' )
plt.tight_layout()
plt.show()
Example 3: Load with Full Feature Matrix
For advanced analysis, load the complete feature database:
# Load full database with all features
df_full = pd.read_csv(
'data/genomes/GRCh38/g27/trifid_db.tsv.gz' ,
compression = 'gzip' ,
sep = ' \t '
)
# View available features
print ( f "Total features: { df_full.shape[ 1 ] } " )
print ( f "Total isoforms: { df_full.shape[ 0 ] } " )
# Examine specific gene with all features
fgfr1_full = df_full[df_full[ 'gene_name' ] == 'FGFR1' ]
print (fgfr1_full[[ 'transcript_id' , 'firestar' , 'corsair' , 'spade' , 'RNA2sj' , 'pfam_score' ]].head())
Key Features to Explore
firestar: Functional residue conservation
matador3d: 3D structure conservation
corsair: Cross-species conservation
corsair_alt: Alternative exon conservation
PhyloCSF_Psi: Evolutionary coding score
RNA2sj: Splice junction coverage score
RNA2sj_cds: CDS-specific junction coverage
norm_RNA2sj: Gene-normalized expression
pfam_score: Pfam domain preservation
pfam_domains_impact_score: Domain integrity
perc_Damaged_State: Damaged domain percentage
perc_Lost_State: Lost domain percentage
tsl: Transcript support level (1-5)
CCDS: CCDS annotation status
basic: GENCODE basic set membership
nonsense_mediated_decay: NMD prediction
Example 4: Using TRIFID Modules
TRIFID includes specialized modules for computing features from raw data.
Load Pre-computed Data with TRIFID Loaders
from trifid.data.loaders import (
load_appris,
load_qsplice,
load_qpfam,
load_sequences
)
# Load APPRIS annotations
df_appris = load_appris( 'data/external/appris/GRCh38/g27/appris_data.appris.txt' )
# Load QSplice scores
df_qsplice = load_qsplice( 'data/external/qsplice/GRCh38/g27/qsplice.emtab2836.g27.tsv.gz' )
# Load Pfam effects
df_qpfam = load_qpfam( 'data/external/pfam_effects/GRCh38/g27/qpfam.tsv.gz' )
# Load protein sequences
df_sequences = load_sequences( 'data/external/appris/GRCh38/g27/appris_data.transl.fa.gz' )
print ( f "APPRIS isoforms: { len (df_appris) } " )
print ( f "QSplice scores: { len (df_qsplice) } " )
print ( f "QPfam scores: { len (df_qpfam) } " )
Compute QSplice Scores from RNA-seq Data
This requires STAR RNA-seq alignment outputs (SJ.out.tab files).
python -m trifid.preprocessing.qsplice \
--gff data/external/genome_annotation/GRCh38/g27/gencode.v27.annotation.gff3.gz \
--outdir data/external/qsplice/GRCh38/g27 \
--samples out/E-MTAB-2836/GRCh38/STAR/g27 \
--version g
Compute Pfam Effects
python -m trifid.preprocessing.pfam_effects \
--appris data/external/appris/GRCh38/g27/appris_data.appris.txt \
--jobs 10 \
--seqs data/external/appris/GRCh38/g27/appris_data.transl.fa.gz \
--spade data/external/appris/GRCh38/g27/appris_method.spade.gtf.gz \
--outdir data/external/pfam_effects/GRCh38/g27
Example 5: Filter by Criteria
Find genes with multiple functional isoforms:
# High confidence threshold
threshold = 0.7
# Find genes with multiple high-scoring isoforms
multi_functional = predictions[
predictions[ 'trifid_score' ] >= threshold
].groupby( 'gene_name' ).filter( lambda x : len (x) >= 2 )
# Show examples
for gene in multi_functional[ 'gene_name' ].unique()[: 5 ]:
gene_data = multi_functional[
multi_functional[ 'gene_name' ] == gene
][[ 'transcript_id' , 'trifid_score' , 'appris' , 'length' ]]
print ( f " \n { gene } :" )
print (gene_data.to_string( index = False ))
Working with Specific Transcripts
Retrieve TRIFID score for a specific transcript:
# Query by transcript ID
transcript_id = 'ENST00000447712'
result = predictions[
predictions[ 'transcript_id' ] == transcript_id
][[ 'gene_name' , 'transcript_id' , 'trifid_score' , 'norm_trifid_score' , 'appris' ]]
if not result.empty:
print (result.to_string( index = False ))
else :
print ( f "Transcript { transcript_id } not found" )
Exporting Results
Save filtered results for downstream analysis:
# Export high-confidence isoforms
high_confidence = predictions[predictions[ 'trifid_score' ] >= 0.7 ]
high_confidence.to_csv(
'high_confidence_isoforms.tsv' ,
sep = ' \t ' ,
index = False
)
print ( f "Exported { len (high_confidence) } high-confidence isoforms" )
Interactive Analysis
For interactive exploration, use the Jupyter notebook tutorial:
# Start Jupyter Lab
jupyter lab
# Open the tutorial notebook
# notebooks/01.tutorial.ipynb
The tutorial covers:
Loading and exploring data
Model training and evaluation
Feature importance analysis
SHAP value interpretation
Generating publication figures
Tutorial Notebook View the complete tutorial on GitHub
Common Pitfalls
Avoid these common mistakes:
Transcript ID versions : TRIFID uses IDs without version numbers (e.g., ENST00000447712, not ENST00000447712.2)
Missing data : Some features may have NaN values for specific assemblies - check documentation
Score interpretation : Don’t use raw scores to compare across genes - use normalized scores
File compression : Always use compression='gzip' when reading .gz files
Next Steps
Advanced Usage Learn about model training, feature engineering, and custom predictions
API Reference Explore the complete TRIFID Python API
TRIFID Modules Deep dive into QSplice, Pfam effects, and fragment labeling
Use Cases See real-world applications and case studies
Get Help
Need assistance?