Overview
The data preparation pipeline involves:- Setting up configuration files
- Downloading and organizing genome annotations
- Running preprocessing modules (QSplice, Pfam effects)
- Building the feature database
Prerequisites
Before starting, ensure you have:- GENCODE or Ensembl genome annotations (GTF/GFF3 format)
- APPRIS data files for your genome assembly
- Protein sequences in FASTA format
- RNA-seq data for splice junction quantification (optional)
- Pfam domain annotations (optional)
Configuration Files
TRIFID uses YAML configuration files to specify data sources and features.Creating config.yaml
The main configuration file defines paths to all data sources:config/config.yaml
Creating features.yaml
Define which features to include in the final dataset:config/features.yaml
Running QSplice: Splice Junction Quantification
QSplice quantifies splice junction support from RNA-seq data, providing evidence for transcript usage.Step 1: Prepare RNA-seq Data
If you have STAR alignment output:Step 2: Run QSplice
Map splice junctions
The module maps RNA-seq splice junctions to annotated introns and calculates coverage scores.
Output Files
QSplice generates:qsplice.emtab2836.tsv.gz: Final scores per transcriptsj_maxp.emtab2836.mapped.tsv.gz: Junction-level scores- Intermediate intron annotations
Running Pfam Effects: Domain Integrity Analysis
The Pfam effects module quantifies how alternative splicing impacts protein domain structure.Step 1: Prepare Input Files
You need:- APPRIS data file (
appris_data.appris.txt) - Protein sequences (
appris_data.transl.fa.gz) - SPADE domain annotations (
appris_method.spade.gtf.gz)
Step 2: Run Pfam Effects
The
--jobs parameter controls parallel processing. Set it based on available CPU cores.How It Works
Select reference transcripts
For each gene, selects the principal isoform as reference based on:
- APPRIS PRINCIPAL annotation
- CCDS status
- Transcript support level (TSL)
- Sequence length
Output Metrics
The module generates several scores inqpfam.tsv.gz:
pfam_score: Overall domain integrity (0-1, higher = more intact)pfam_domains_impact_score: Proportion of domains affectedperc_Damaged_State: Percentage of partially damaged domainsperc_Lost_State: Percentage of completely lost domainsLost_residues_pfam: Number of residues lost in domains
Building the Complete Dataset
Once preprocessing is complete, generate the final TRIFID database:Running make_dataset
What Happens
Load all data sources
The script loads:
- Genome annotation (trifid/data/loaders.py:210)
- APPRIS scores
- QSplice results
- Pfam effects
- PhyloCSF conservation scores
- Other optional features
Feature engineering
Applies transformations (trifid/data/feature_engineering.py:158):
- Group normalization per gene
- Delta scores (difference from reference isoform)
- Fragment correction
- One-hot encoding of categorical features
Output Format
The final database looks like:Verify your database has:
- All expected transcripts from your annotation
- No missing values in required features
- Normalized scores between 0 and 1
Data Directory Structure
Organize your data following this structure:Troubleshooting
Missing Features
Problem: Some features are all NaN in the final database. Solution: Check that:- The data source file exists and is properly formatted
- Transcript IDs match between files (with/without version numbers)
- The config.yaml path is correct
Memory Errors
Problem: QSplice or Pfam effects runs out of memory. Solution:- Increase available RAM
- Process chromosomes separately
- Reduce the number of RNA-seq samples
Transcript ID Mismatches
Problem: Merge operations fail due to ID format differences. Solution: TRIFID handles both formats:- With version:
ENST00000380152.7 - Without version:
ENST00000380152
Next Steps
With your data prepared:Train Models
Learn how to train custom TRIFID models
Make Predictions
Apply trained models to score isoforms