Skip to main content

TRIFID End-to-End Tutorial

This comprehensive tutorial walks through the complete TRIFID workflow, from data preparation to making predictions and interpreting results.

Overview

You’ll learn how to:
  1. Set up the TRIFID environment
  2. Prepare input data and features
  3. Train a TRIFID model
  4. Make predictions on new genome annotations
  5. Interpret results with SHAP
  6. Analyze predictions across the genome
This tutorial uses GENCODE 27 (human) as an example, but the workflow applies to any well-annotated genome.

Prerequisites

Installation

# Clone the repository
git clone [email protected]:fpozoc/trifid.git
cd trifid

# Create conda environment
mamba env create -f environment.yml
conda activate trifid

# Install pre-commit hooks
pre-commit install

# Install package in development mode
pip install -e .[dev]

Verify Installation

# Run tests
pytest -v

# Check imports
python -c "import trifid; print(trifid.__version__)"

1. Data Preparation

1.1 Download Source Data

cd trifid

# Create directories
mkdir -p data/external/genome_annotation/GRCh38/g27
mkdir -p data/external/appris/GRCh38/g27

# Download GENCODE annotations
curl ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_27/gencode.v27.annotation.gtf.gz \
  -o data/external/genome_annotation/GRCh38/g27/gencode.v27.annotation.gtf.gz

curl ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_27/gencode.v27.annotation.gff3.gz \
  -o data/external/genome_annotation/GRCh38/g27/gencode.v27.annotation.gff3.gz

# Download APPRIS annotations
curl http://apprisws.bioinfo.cnio.es/pub/current_release/datafiles/homo_sapiens/e90v35/appris_data.principal.txt \
  -o data/external/appris/GRCh38/g27/appris_data.principal.txt

curl http://apprisws.bioinfo.cnio.es/pub/current_release/datafiles/homo_sapiens/e90v35/appris_data.appris.txt \
  -o data/external/appris/GRCh38/g27/appris_data.appris.txt

curl http://apprisws.bioinfo.cnio.es/pub/current_release/datafiles/homo_sapiens/e90v35/appris_data.transl.fa.gz \
  -o data/external/appris/GRCh38/g27/appris_data.transl.fa.gz

1.2 Generate QSplice Scores

QSplice quantifies RNA-seq splice junction coverage:
python -m trifid.preprocessing.qsplice \
    --gff data/external/genome_annotation/GRCh38/g27/gencode.v27.annotation.gff3.gz \
    --outdir data/external/qsplice/GRCh38/g27 \
    --samples out/E-MTAB-2836/GRCh38/STAR/g27 \
    --version g
Output:
  • sj_maxp.emtab2836.mapped.tsv.gz: Splice junction scores
  • qsplice.emtab2836.g27.tsv.gz: Transcript-level scores
Pre-computed QSplice scores are available in the data repository to skip this step.

1.3 Generate Pfam Effects

Calculate domain integrity scores:
python -m trifid.preprocessing.pfam_effects \
    --appris data/external/appris/GRCh38/g27/appris_data.appris.txt \
    --jobs 10 \
    --seqs data/external/appris/GRCh38/g27/appris_data.transl.fa.gz \
    --spade data/external/appris/GRCh38/g27/appris_method.spade.gtf.gz \
    --outdir data/external/pfam_effects/GRCh38/g27
Output:
  • qpfam.tsv.gz: Domain conservation scores

1.4 Label Fragments

Identify redundant and fragmented isoforms:
python -m trifid.preprocessing.label_fragments \
    --gtf data/external/genome_annotation/GRCh38/g27/gencode.v27.annotation.gtf.gz \
    --seqs data/external/appris/GRCh38/g27/appris_data.transl.fa.gz \
    --principals data/external/appris/GRCh38/g27/appris_data.principal.txt \
    --outdir data/external/label_fragments/GRCh38/g27

2. Feature Engineering

2.1 Load Configuration

import os
import pandas as pd
from trifid.utils.utils import parse_yaml, create_dir
from trifid.data.feature_engineering import build_features, load_data

# Set paths
TRIFID_DIR = os.path.expanduser('~/trifid')
CONFIG_PATH = os.path.join(TRIFID_DIR, 'config/config.yaml')
FEATURES_PATH = os.path.join(TRIFID_DIR, 'config/features.yaml')

# Load configuration
config = parse_yaml(CONFIG_PATH)
df_features_config = pd.DataFrame(parse_yaml(FEATURES_PATH))

# Extract feature names
features = df_features_config[
    ~df_features_config['category'].str.contains('Identifier')
]['feature'].values

ids = df_features_config[
    df_features_config['category'].str.contains('Identifier')
]['feature'].values

print(f"Total features: {len(features)}")
print(f"Feature categories:")
print(df_features_config.groupby('category')['feature'].count())

2.2 Build Feature Dataset

# Create output directory
create_dir(os.path.join(TRIFID_DIR, 'data', 'genomes', 'GRCh38', 'g27'))

# Load and process data
df_g27 = load_data(config, assembly='GRCh38', release='g27')
df_g27 = build_features(df_g27)

# Save processed features
output_path = os.path.join(
    TRIFID_DIR, 'data', 'genomes', 'GRCh38', 'g27', 'trifid_db.tsv.gz'
)
df_g27[df_features_config.feature.values].drop('sequence', axis=1).to_csv(
    output_path, index=None, sep='\t', compression='gzip'
)

print(f"Feature dataset saved: {output_path}")
print(f"Shape: {df_g27.shape}")

3. Model Training

3.1 Prepare Training Set

from trifid.utils.utils import generate_training_set, balanced_training_set

# Load training labels (proteomics evidence)
df_training_set_initial = pd.read_csv(
    os.path.join(TRIFID_DIR, 'data', 'model', 'training_set_initial.g27.tsv.gz'),
    sep='\t'
)

# Create labels
df_training_set = df_training_set_initial.copy()
df_training_set.loc[
    df_training_set['state'].str.contains('F'), 'label'
] = 1
df_training_set.loc[
    df_training_set['state'].str.contains('U'), 'label'
] = 0

# Filter to labeled examples
df_training_set = df_training_set.loc[
    ~df_training_set['label'].isnull()
]

# Remove duplicates
df_training_set = df_training_set.loc[
    df_training_set['added'].str.contains('v1|r|v3')
].drop(['added', 'state', 'comment'], axis=1).reset_index(drop=True)

print(f"Training set size: {df_training_set.shape}")
print(f"Class balance:")
print(df_training_set['label'].value_counts(normalize=True) * 100)

3.2 Train Random Forest Model

from trifid.models.select import Classifier
from sklearn.ensemble import RandomForestClassifier
import pickle

# Define model
model = RandomForestClassifier(
    min_samples_leaf=6,
    n_estimators=400,
    n_jobs=-1,
    random_state=123
)

# Create classifier
classifier = Classifier(
    model=model,
    df=df_training_set,
    features_col=df_training_set[features].columns,
    target_col='label',
    random_state=123
)

# Save model
model_path = os.path.join(TRIFID_DIR, 'models', 'selected_model.pkl')
classifier.save_model(outdir=os.path.join(TRIFID_DIR, 'models'))
print(f"Model saved: {model_path}")

3.3 Evaluate Model Performance

# Get evaluation metrics
print("Model Parameters:")
print(classifier.model)

print("\nPerformance Metrics:")
print(classifier.evaluate)

print("\nConfusion Matrix:")
print(classifier.confusion_matrix)

print("\nClassification Report:")
print(classifier.classification_report)

print("\nCross-Validation Scores:")
print(classifier.cross_validate)
Expected Output:
  • Accuracy: ~0.85-0.90
  • Precision: ~0.83-0.88
  • Recall: ~0.85-0.90
  • F1-Score: ~0.84-0.89
  • AUC-ROC: ~0.92-0.95

4. Making Predictions

4.1 Predict GENCODE 27 Isoforms

from trifid.utils.utils import generate_trifid_metrics

# Load feature dataset
df_g27 = pd.read_csv(
    os.path.join(TRIFID_DIR, 'data', 'genomes', 'GRCh38', 'g27', 'trifid_db.tsv.gz'),
    sep='\t',
    compression='gzip'
)

# Separate features and identifiers
df_g27_features = df_g27[features]
df_g27_predictions = df_g27[ids]

# Load trained model
model = pickle.load(
    open(os.path.join(TRIFID_DIR, 'models', 'selected_model.pkl'), 'rb')
)

# Generate predictions
df_g27_predictions = generate_trifid_metrics(
    df_g27_predictions, 
    df_g27_features, 
    model
)

# Select output columns
df_g27_predictions = df_g27_predictions[[
    'gene_id', 'gene_name', 'transcript_id', 'translation_id', 
    'flags', 'ccdsid', 'appris', 'ann_type', 'length', 
    'trifid_score', 'norm_trifid_score'
]]

# Save predictions
output_path = os.path.join(
    TRIFID_DIR, 'data', 'genomes', 'GRCh38', 'g27', 'trifid_predictions.tsv.gz'
)
df_g27_predictions.to_csv(
    output_path, index=None, sep='\t', compression='gzip'
)

print(f"Predictions saved: {output_path}")
print(f"Total isoforms predicted: {len(df_g27_predictions)}")

4.2 Genome-Wide Statistics

from trifid.utils.utils import Statistics

# Calculate statistics
stats = Statistics(df_g27_predictions)
print("\nGenome-wide TRIFID Statistics (cutoff=0.5):")
print(stats.get_stats())
Example Output:
              Functional  Non functional  Total  Percentage of functional
PRINCIPAL         18543            1689  20232                     91.65
ALTERNATIVE       10256           45187  55443                     18.50
Total             28799           46876  75675                     38.05

5. Analyzing Specific Genes

5.1 Query Gene of Interest

def analyze_gene(gene_name, predictions_df):
    """
    Analyze TRIFID predictions for a specific gene.
    """
    gene_data = predictions_df[
        predictions_df['gene_name'] == gene_name
    ].sort_values('trifid_score', ascending=False)
    
    print(f"\n=== {gene_name} Analysis ===")
    print(f"Total isoforms: {len(gene_data)}")
    print(f"Functional (score >= 0.5): {(gene_data['trifid_score'] >= 0.5).sum()}")
    print(f"\nTop scoring isoform:")
    print(f"  Transcript: {gene_data.iloc[0]['transcript_id']}")
    print(f"  Score: {gene_data.iloc[0]['trifid_score']:.3f}")
    print(f"  APPRIS: {gene_data.iloc[0]['appris']}")
    
    return gene_data

# Example: Analyze FGFR1
fgfr1_results = analyze_gene('FGFR1', df_g27_predictions)
print("\nFull results:")
print(fgfr1_results[['transcript_id', 'appris', 'length', 'trifid_score']])

6. Model Interpretation with SHAP

6.1 Global Feature Importance

from trifid.models.interpret import TreeInterpretation

# Create interpretation object
interpretation = TreeInterpretation(
    model=model,
    df=df_training_set,
    features_col=df_training_set[features].columns,
    target_col='label',
    random_state=123,
    test_size=0.25
)

# Get SHAP values
shap_values = interpretation.shap
print("\nTop 10 Most Important Features:")
print(shap_values.head(10))

6.2 Local Explanation for Gene

# Explain all isoforms of a gene
explanation = interpretation.local_explanation(df_g27, 'FGFR1')
print("\nFGFR1 Isoform SHAP Values:")
print(explanation)

6.3 Local Explanation for Single Isoform

# Explain specific isoform
isoform_explanation = interpretation.local_explanation(
    df_g27, 
    sample='ENST00000356207'
)
print("\nTop features for ENST00000356207:")
print(isoform_explanation.head(10))

7. Visualization

7.1 Score Distribution

import matplotlib.pyplot as plt
import seaborn as sns

# Set style
sns.set_style("whitegrid")

# Create figure
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Overall score distribution
ax1 = axes[0]
ax1.hist(
    df_g27_predictions['trifid_score'], 
    bins=50, 
    color='steelblue', 
    edgecolor='black', 
    alpha=0.7
)
ax1.axvline(x=0.5, color='red', linestyle='--', label='Functional threshold')
ax1.set_xlabel('TRIFID Score')
ax1.set_ylabel('Frequency')
ax1.set_title('TRIFID Score Distribution (GENCODE 27)')
ax1.legend()

# Plot 2: Principal vs Alternative
ax2 = axes[1]
principal_scores = df_g27_predictions[
    df_g27_predictions['appris'].str.contains('PRINCIPAL')
]['trifid_score']
alternative_scores = df_g27_predictions[
    ~df_g27_predictions['appris'].str.contains('PRINCIPAL')
]['trifid_score']

ax2.hist(
    [principal_scores, alternative_scores],
    bins=50,
    label=['PRINCIPAL', 'ALTERNATIVE'],
    color=['green', 'orange'],
    alpha=0.6
)
ax2.axvline(x=0.5, color='red', linestyle='--')
ax2.set_xlabel('TRIFID Score')
ax2.set_ylabel('Frequency')
ax2.set_title('Score Distribution by APPRIS Label')
ax2.legend()

plt.tight_layout()
plt.savefig('trifid_score_distribution.png', dpi=300)
plt.show()

7.2 Feature Importance Plot

# Get feature importances
importances = pd.DataFrame({
    'feature': features,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

# Plot top 20 features
plt.figure(figsize=(10, 8))
plt.barh(
    importances.head(20)['feature'], 
    importances.head(20)['importance'],
    color='steelblue'
)
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.title('Top 20 Most Important Features in TRIFID')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.savefig('feature_importance.png', dpi=300)
plt.show()

8. Exporting Results

8.1 Summary Statistics

def export_summary(predictions_df, output_file):
    """
    Export summary statistics to file.
    """
    summary = {
        'total_genes': predictions_df['gene_name'].nunique(),
        'total_isoforms': len(predictions_df),
        'principal_isoforms': (predictions_df['appris'].str.contains('PRINCIPAL')).sum(),
        'alternative_isoforms': (~predictions_df['appris'].str.contains('PRINCIPAL')).sum(),
        'functional_isoforms': (predictions_df['trifid_score'] >= 0.5).sum(),
        'non_functional_isoforms': (predictions_df['trifid_score'] < 0.5).sum(),
        'mean_score': predictions_df['trifid_score'].mean(),
        'median_score': predictions_df['trifid_score'].median()
    }
    
    summary_df = pd.DataFrame([summary]).T
    summary_df.columns = ['Value']
    summary_df.to_csv(output_file, sep='\t')
    print(f"Summary exported to {output_file}")
    return summary_df

summary = export_summary(
    df_g27_predictions, 
    'trifid_summary_gencode27.tsv'
)
print(summary)

8.2 High-Confidence Functional Alternatives

# Find high-scoring alternative isoforms
functional_alternatives = df_g27_predictions[
    (~df_g27_predictions['appris'].str.contains('PRINCIPAL')) &
    (df_g27_predictions['trifid_score'] >= 0.7)
].sort_values('trifid_score', ascending=False)

print(f"\nHigh-confidence functional alternatives: {len(functional_alternatives)}")
print("\nTop 10:")
print(functional_alternatives.head(10)[[
    'gene_name', 'transcript_id', 'trifid_score', 'appris'
]])

# Export
functional_alternatives.to_csv(
    'functional_alternative_isoforms.tsv', 
    sep='\t', 
    index=False
)

9. Next Steps

Apply to New Genome Versions

# Example: Predict GENCODE 42
df_g42 = load_data(config, assembly='GRCh38', release='g42')
df_g42 = build_features(df_g42)
df_g42_predictions = generate_trifid_metrics(
    df_g42[ids], 
    df_g42[features], 
    model
)

Apply to Other Species

# Example: Predict mouse GENCODE M25
df_gm25 = load_data(config, assembly='GRCm38', release='g25')
df_gm25 = build_features(df_gm25)
df_gm25_predictions = generate_trifid_metrics(
    df_gm25[ids], 
    df_gm25[features], 
    model
)

Troubleshooting

Common Issues

Missing features:
# Impute missing values
df[features_with_na] = df[features_with_na].fillna(-1)
Memory issues with large genomes:
from trifid.utils.utils import reduce_mem_usage
df, na_list = reduce_mem_usage(df, verbose=True)
Model compatibility:
# Ensure feature order matches training
df_features = df[features]  # Use exact feature list from training

Conclusion

You’ve now completed the full TRIFID workflow! You can:
  • ✅ Prepare genome annotations and features
  • ✅ Train TRIFID models
  • ✅ Make predictions on any genome
  • ✅ Interpret results with SHAP
  • ✅ Analyze functional isoforms

Further Reading

Citation

If you use TRIFID in your research, please cite:
@article{pozo2021trifid,
  title={Assessing the functional relevance of splice isoforms},
  author={Pozo, Fernando and others},
  journal={NAR Genomics and Bioinformatics},
  volume={3},
  number={2},
  year={2021},
  publisher={Oxford University Press}
}

Build docs developers (and LLMs) love