TRIFID End-to-End Tutorial
This comprehensive tutorial walks through the complete TRIFID workflow, from data preparation to making predictions and interpreting results.
Overview
You’ll learn how to:
- Set up the TRIFID environment
- Prepare input data and features
- Train a TRIFID model
- Make predictions on new genome annotations
- Interpret results with SHAP
- Analyze predictions across the genome
This tutorial uses GENCODE 27 (human) as an example, but the workflow applies to any well-annotated genome.
Prerequisites
Installation
# Clone the repository
git clone [email protected]:fpozoc/trifid.git
cd trifid
# Create conda environment
mamba env create -f environment.yml
conda activate trifid
# Install pre-commit hooks
pre-commit install
# Install package in development mode
pip install -e .[dev]
Verify Installation
# Run tests
pytest -v
# Check imports
python -c "import trifid; print(trifid.__version__)"
1. Data Preparation
1.1 Download Source Data
cd trifid
# Create directories
mkdir -p data/external/genome_annotation/GRCh38/g27
mkdir -p data/external/appris/GRCh38/g27
# Download GENCODE annotations
curl ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_27/gencode.v27.annotation.gtf.gz \
-o data/external/genome_annotation/GRCh38/g27/gencode.v27.annotation.gtf.gz
curl ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_27/gencode.v27.annotation.gff3.gz \
-o data/external/genome_annotation/GRCh38/g27/gencode.v27.annotation.gff3.gz
# Download APPRIS annotations
curl http://apprisws.bioinfo.cnio.es/pub/current_release/datafiles/homo_sapiens/e90v35/appris_data.principal.txt \
-o data/external/appris/GRCh38/g27/appris_data.principal.txt
curl http://apprisws.bioinfo.cnio.es/pub/current_release/datafiles/homo_sapiens/e90v35/appris_data.appris.txt \
-o data/external/appris/GRCh38/g27/appris_data.appris.txt
curl http://apprisws.bioinfo.cnio.es/pub/current_release/datafiles/homo_sapiens/e90v35/appris_data.transl.fa.gz \
-o data/external/appris/GRCh38/g27/appris_data.transl.fa.gz
1.2 Generate QSplice Scores
QSplice quantifies RNA-seq splice junction coverage:
python -m trifid.preprocessing.qsplice \
--gff data/external/genome_annotation/GRCh38/g27/gencode.v27.annotation.gff3.gz \
--outdir data/external/qsplice/GRCh38/g27 \
--samples out/E-MTAB-2836/GRCh38/STAR/g27 \
--version g
Output:
sj_maxp.emtab2836.mapped.tsv.gz: Splice junction scores
qsplice.emtab2836.g27.tsv.gz: Transcript-level scores
Pre-computed QSplice scores are available in the data repository to skip this step.
1.3 Generate Pfam Effects
Calculate domain integrity scores:
python -m trifid.preprocessing.pfam_effects \
--appris data/external/appris/GRCh38/g27/appris_data.appris.txt \
--jobs 10 \
--seqs data/external/appris/GRCh38/g27/appris_data.transl.fa.gz \
--spade data/external/appris/GRCh38/g27/appris_method.spade.gtf.gz \
--outdir data/external/pfam_effects/GRCh38/g27
Output:
qpfam.tsv.gz: Domain conservation scores
1.4 Label Fragments
Identify redundant and fragmented isoforms:
python -m trifid.preprocessing.label_fragments \
--gtf data/external/genome_annotation/GRCh38/g27/gencode.v27.annotation.gtf.gz \
--seqs data/external/appris/GRCh38/g27/appris_data.transl.fa.gz \
--principals data/external/appris/GRCh38/g27/appris_data.principal.txt \
--outdir data/external/label_fragments/GRCh38/g27
2. Feature Engineering
2.1 Load Configuration
import os
import pandas as pd
from trifid.utils.utils import parse_yaml, create_dir
from trifid.data.feature_engineering import build_features, load_data
# Set paths
TRIFID_DIR = os.path.expanduser('~/trifid')
CONFIG_PATH = os.path.join(TRIFID_DIR, 'config/config.yaml')
FEATURES_PATH = os.path.join(TRIFID_DIR, 'config/features.yaml')
# Load configuration
config = parse_yaml(CONFIG_PATH)
df_features_config = pd.DataFrame(parse_yaml(FEATURES_PATH))
# Extract feature names
features = df_features_config[
~df_features_config['category'].str.contains('Identifier')
]['feature'].values
ids = df_features_config[
df_features_config['category'].str.contains('Identifier')
]['feature'].values
print(f"Total features: {len(features)}")
print(f"Feature categories:")
print(df_features_config.groupby('category')['feature'].count())
2.2 Build Feature Dataset
# Create output directory
create_dir(os.path.join(TRIFID_DIR, 'data', 'genomes', 'GRCh38', 'g27'))
# Load and process data
df_g27 = load_data(config, assembly='GRCh38', release='g27')
df_g27 = build_features(df_g27)
# Save processed features
output_path = os.path.join(
TRIFID_DIR, 'data', 'genomes', 'GRCh38', 'g27', 'trifid_db.tsv.gz'
)
df_g27[df_features_config.feature.values].drop('sequence', axis=1).to_csv(
output_path, index=None, sep='\t', compression='gzip'
)
print(f"Feature dataset saved: {output_path}")
print(f"Shape: {df_g27.shape}")
3. Model Training
3.1 Prepare Training Set
from trifid.utils.utils import generate_training_set, balanced_training_set
# Load training labels (proteomics evidence)
df_training_set_initial = pd.read_csv(
os.path.join(TRIFID_DIR, 'data', 'model', 'training_set_initial.g27.tsv.gz'),
sep='\t'
)
# Create labels
df_training_set = df_training_set_initial.copy()
df_training_set.loc[
df_training_set['state'].str.contains('F'), 'label'
] = 1
df_training_set.loc[
df_training_set['state'].str.contains('U'), 'label'
] = 0
# Filter to labeled examples
df_training_set = df_training_set.loc[
~df_training_set['label'].isnull()
]
# Remove duplicates
df_training_set = df_training_set.loc[
df_training_set['added'].str.contains('v1|r|v3')
].drop(['added', 'state', 'comment'], axis=1).reset_index(drop=True)
print(f"Training set size: {df_training_set.shape}")
print(f"Class balance:")
print(df_training_set['label'].value_counts(normalize=True) * 100)
3.2 Train Random Forest Model
from trifid.models.select import Classifier
from sklearn.ensemble import RandomForestClassifier
import pickle
# Define model
model = RandomForestClassifier(
min_samples_leaf=6,
n_estimators=400,
n_jobs=-1,
random_state=123
)
# Create classifier
classifier = Classifier(
model=model,
df=df_training_set,
features_col=df_training_set[features].columns,
target_col='label',
random_state=123
)
# Save model
model_path = os.path.join(TRIFID_DIR, 'models', 'selected_model.pkl')
classifier.save_model(outdir=os.path.join(TRIFID_DIR, 'models'))
print(f"Model saved: {model_path}")
# Get evaluation metrics
print("Model Parameters:")
print(classifier.model)
print("\nPerformance Metrics:")
print(classifier.evaluate)
print("\nConfusion Matrix:")
print(classifier.confusion_matrix)
print("\nClassification Report:")
print(classifier.classification_report)
print("\nCross-Validation Scores:")
print(classifier.cross_validate)
Expected Output:
- Accuracy: ~0.85-0.90
- Precision: ~0.83-0.88
- Recall: ~0.85-0.90
- F1-Score: ~0.84-0.89
- AUC-ROC: ~0.92-0.95
4. Making Predictions
from trifid.utils.utils import generate_trifid_metrics
# Load feature dataset
df_g27 = pd.read_csv(
os.path.join(TRIFID_DIR, 'data', 'genomes', 'GRCh38', 'g27', 'trifid_db.tsv.gz'),
sep='\t',
compression='gzip'
)
# Separate features and identifiers
df_g27_features = df_g27[features]
df_g27_predictions = df_g27[ids]
# Load trained model
model = pickle.load(
open(os.path.join(TRIFID_DIR, 'models', 'selected_model.pkl'), 'rb')
)
# Generate predictions
df_g27_predictions = generate_trifid_metrics(
df_g27_predictions,
df_g27_features,
model
)
# Select output columns
df_g27_predictions = df_g27_predictions[[
'gene_id', 'gene_name', 'transcript_id', 'translation_id',
'flags', 'ccdsid', 'appris', 'ann_type', 'length',
'trifid_score', 'norm_trifid_score'
]]
# Save predictions
output_path = os.path.join(
TRIFID_DIR, 'data', 'genomes', 'GRCh38', 'g27', 'trifid_predictions.tsv.gz'
)
df_g27_predictions.to_csv(
output_path, index=None, sep='\t', compression='gzip'
)
print(f"Predictions saved: {output_path}")
print(f"Total isoforms predicted: {len(df_g27_predictions)}")
4.2 Genome-Wide Statistics
from trifid.utils.utils import Statistics
# Calculate statistics
stats = Statistics(df_g27_predictions)
print("\nGenome-wide TRIFID Statistics (cutoff=0.5):")
print(stats.get_stats())
Example Output:
Functional Non functional Total Percentage of functional
PRINCIPAL 18543 1689 20232 91.65
ALTERNATIVE 10256 45187 55443 18.50
Total 28799 46876 75675 38.05
5. Analyzing Specific Genes
5.1 Query Gene of Interest
def analyze_gene(gene_name, predictions_df):
"""
Analyze TRIFID predictions for a specific gene.
"""
gene_data = predictions_df[
predictions_df['gene_name'] == gene_name
].sort_values('trifid_score', ascending=False)
print(f"\n=== {gene_name} Analysis ===")
print(f"Total isoforms: {len(gene_data)}")
print(f"Functional (score >= 0.5): {(gene_data['trifid_score'] >= 0.5).sum()}")
print(f"\nTop scoring isoform:")
print(f" Transcript: {gene_data.iloc[0]['transcript_id']}")
print(f" Score: {gene_data.iloc[0]['trifid_score']:.3f}")
print(f" APPRIS: {gene_data.iloc[0]['appris']}")
return gene_data
# Example: Analyze FGFR1
fgfr1_results = analyze_gene('FGFR1', df_g27_predictions)
print("\nFull results:")
print(fgfr1_results[['transcript_id', 'appris', 'length', 'trifid_score']])
6. Model Interpretation with SHAP
6.1 Global Feature Importance
from trifid.models.interpret import TreeInterpretation
# Create interpretation object
interpretation = TreeInterpretation(
model=model,
df=df_training_set,
features_col=df_training_set[features].columns,
target_col='label',
random_state=123,
test_size=0.25
)
# Get SHAP values
shap_values = interpretation.shap
print("\nTop 10 Most Important Features:")
print(shap_values.head(10))
6.2 Local Explanation for Gene
# Explain all isoforms of a gene
explanation = interpretation.local_explanation(df_g27, 'FGFR1')
print("\nFGFR1 Isoform SHAP Values:")
print(explanation)
# Explain specific isoform
isoform_explanation = interpretation.local_explanation(
df_g27,
sample='ENST00000356207'
)
print("\nTop features for ENST00000356207:")
print(isoform_explanation.head(10))
7. Visualization
7.1 Score Distribution
import matplotlib.pyplot as plt
import seaborn as sns
# Set style
sns.set_style("whitegrid")
# Create figure
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Plot 1: Overall score distribution
ax1 = axes[0]
ax1.hist(
df_g27_predictions['trifid_score'],
bins=50,
color='steelblue',
edgecolor='black',
alpha=0.7
)
ax1.axvline(x=0.5, color='red', linestyle='--', label='Functional threshold')
ax1.set_xlabel('TRIFID Score')
ax1.set_ylabel('Frequency')
ax1.set_title('TRIFID Score Distribution (GENCODE 27)')
ax1.legend()
# Plot 2: Principal vs Alternative
ax2 = axes[1]
principal_scores = df_g27_predictions[
df_g27_predictions['appris'].str.contains('PRINCIPAL')
]['trifid_score']
alternative_scores = df_g27_predictions[
~df_g27_predictions['appris'].str.contains('PRINCIPAL')
]['trifid_score']
ax2.hist(
[principal_scores, alternative_scores],
bins=50,
label=['PRINCIPAL', 'ALTERNATIVE'],
color=['green', 'orange'],
alpha=0.6
)
ax2.axvline(x=0.5, color='red', linestyle='--')
ax2.set_xlabel('TRIFID Score')
ax2.set_ylabel('Frequency')
ax2.set_title('Score Distribution by APPRIS Label')
ax2.legend()
plt.tight_layout()
plt.savefig('trifid_score_distribution.png', dpi=300)
plt.show()
7.2 Feature Importance Plot
# Get feature importances
importances = pd.DataFrame({
'feature': features,
'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
# Plot top 20 features
plt.figure(figsize=(10, 8))
plt.barh(
importances.head(20)['feature'],
importances.head(20)['importance'],
color='steelblue'
)
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.title('Top 20 Most Important Features in TRIFID')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.savefig('feature_importance.png', dpi=300)
plt.show()
8. Exporting Results
8.1 Summary Statistics
def export_summary(predictions_df, output_file):
"""
Export summary statistics to file.
"""
summary = {
'total_genes': predictions_df['gene_name'].nunique(),
'total_isoforms': len(predictions_df),
'principal_isoforms': (predictions_df['appris'].str.contains('PRINCIPAL')).sum(),
'alternative_isoforms': (~predictions_df['appris'].str.contains('PRINCIPAL')).sum(),
'functional_isoforms': (predictions_df['trifid_score'] >= 0.5).sum(),
'non_functional_isoforms': (predictions_df['trifid_score'] < 0.5).sum(),
'mean_score': predictions_df['trifid_score'].mean(),
'median_score': predictions_df['trifid_score'].median()
}
summary_df = pd.DataFrame([summary]).T
summary_df.columns = ['Value']
summary_df.to_csv(output_file, sep='\t')
print(f"Summary exported to {output_file}")
return summary_df
summary = export_summary(
df_g27_predictions,
'trifid_summary_gencode27.tsv'
)
print(summary)
8.2 High-Confidence Functional Alternatives
# Find high-scoring alternative isoforms
functional_alternatives = df_g27_predictions[
(~df_g27_predictions['appris'].str.contains('PRINCIPAL')) &
(df_g27_predictions['trifid_score'] >= 0.7)
].sort_values('trifid_score', ascending=False)
print(f"\nHigh-confidence functional alternatives: {len(functional_alternatives)}")
print("\nTop 10:")
print(functional_alternatives.head(10)[[
'gene_name', 'transcript_id', 'trifid_score', 'appris'
]])
# Export
functional_alternatives.to_csv(
'functional_alternative_isoforms.tsv',
sep='\t',
index=False
)
9. Next Steps
Apply to New Genome Versions
# Example: Predict GENCODE 42
df_g42 = load_data(config, assembly='GRCh38', release='g42')
df_g42 = build_features(df_g42)
df_g42_predictions = generate_trifid_metrics(
df_g42[ids],
df_g42[features],
model
)
Apply to Other Species
# Example: Predict mouse GENCODE M25
df_gm25 = load_data(config, assembly='GRCm38', release='g25')
df_gm25 = build_features(df_gm25)
df_gm25_predictions = generate_trifid_metrics(
df_gm25[ids],
df_gm25[features],
model
)
Troubleshooting
Common Issues
Missing features:
# Impute missing values
df[features_with_na] = df[features_with_na].fillna(-1)
Memory issues with large genomes:
from trifid.utils.utils import reduce_mem_usage
df, na_list = reduce_mem_usage(df, verbose=True)
Model compatibility:
# Ensure feature order matches training
df_features = df[features] # Use exact feature list from training
Conclusion
You’ve now completed the full TRIFID workflow! You can:
- ✅ Prepare genome annotations and features
- ✅ Train TRIFID models
- ✅ Make predictions on any genome
- ✅ Interpret results with SHAP
- ✅ Analyze functional isoforms
Further Reading
Citation
If you use TRIFID in your research, please cite:
@article{pozo2021trifid,
title={Assessing the functional relevance of splice isoforms},
author={Pozo, Fernando and others},
journal={NAR Genomics and Bioinformatics},
volume={3},
number={2},
year={2021},
publisher={Oxford University Press}
}