Feature Categories
TRIFID integrates 45+ features spanning multiple biological dimensions to predict isoform functionality. These features are organized into seven main categories:Annotation
12 features from transcript metadata
Evolution
8 conservation and evolutionary scores
Structure
14 protein structure and domain features
Splicing
6 alternative splicing impact features
Expression
2 RNA-seq junction support features
Derived
Multiple normalized and delta scores
All features are extracted from the unified TRIFID database (
trifid_db.tsv.gz) which integrates data from APPRIS, PhyloCSF, Pfam, RNA-seq, and genome annotations.Annotation Features
Metadata and annotation quality indicators from GENCODE/Ensembl.basic
- Description: Simplified, high-quality subset of GENCODE annotations
- Type: Binary (0/1)
- Source: GENCODE Basic tag
- Interpretation: 1 = transcript in the basic gene annotation set
- Species Support: GRCh38, GRCm38
CCDS
- Description: Consensus CDS protein set membership
- Type: Binary (0/1)
- Source: NCBI CCDS database
- Interpretation: 1 = transcript has a CCDS identifier (high confidence)
- Species Support: Cross-species (GRCh38, GRCm38, Rnor_6.0)
- Code:
trifid/data/feature_engineering.py:102-109
What is CCDS?
What is CCDS?
The Consensus Coding Sequence (CCDS) project identifies protein-coding regions that are consistently annotated across multiple genome annotation sources. CCDS transcripts are highly confident and experimentally supported.
Transcript Support Level (tsl_1 through tsl_6)
- Description: Confidence in transcript structure based on RNA-seq evidence
- Type: One-hot encoded (6 binary features)
- Categories:
- TSL 1: All splice junctions supported by non-suspect mRNA
- TSL 2: Best supporting mRNA is suspect OR support from multiple ESTs
- TSL 3: Only support from single EST
- TSL 4: Best supporting EST is suspect
- TSL 5: No mRNA support for structure
- TSL 6: Not analyzed for support
- Source: Ensembl transcript quality tags
- Species Support: GRCh38, GRCm38
- Code:
trifid/data/feature_engineering.py:50-73
Annotation Level (level_1, level_2, level_3)
- Description: Annotation confidence based on evidence type
- Type: One-hot encoded (3 binary features)
- Categories:
- Level 1: Verified loci (highest confidence)
- Level 2: Manual annotation (Havana/Ensembl)
- Level 3: Automated annotation
- Source: GENCODE annotation level
- Species Support: GRCh38, GRCm38
nonsense_mediated_decay
- Description: Transcript likely targeted for NMD degradation
- Type: Binary (0/1)
- Source: Ensembl biotype annotation
- Interpretation: 1 = contains premature stop codon, likely degraded
- Species Support: GRCh38, GRCm38, Rnor_6.0, GRCz11
StartEnd_NF
- Description: Start or end codon not found
- Type: Binary (0/1)
- Source: GENCODE tag
- Interpretation: 1 = 5’ or 3’ end incomplete
- Species Support: GRCh38
length
- Description: Protein length in amino acids
- Type: Continuous (integer)
- Source: Protein FASTA sequences
- Range: Typically 10-5,000+ residues
- Species Support: Cross-species
length_delta_score
- Description: Length similarity to principal/longest isoform
- Type: Continuous (0-1)
- Formula:
1 - abs((ref_length - length) / ref_length) - Interpretation: 1.0 = same length as reference, 0.0 = maximally different
- Code:
trifid/utils/utils.py:127-170
Evolution Features
Conservation and evolutionary constraint signals.CORSAIR (corsair)
- Description: Cross-species conservation score
- Type: Continuous (0-4)
- Method: Number of vertebrate species with full-length orthologous isoform
- Source: APPRIS database
- Interpretation: Higher = more conserved across vertebrates
- Species Support: Cross-species
norm_corsair
- Description: Gene-normalized CORSAIR score
- Type: Continuous (0-1)
- Formula:
(corsair - min_gene) / (max(4, max_gene) - min_gene) - Interpretation: 1.0 = highest conservation within gene
- Code:
trifid/data/feature_engineering.py:35-47
Alt-CORSAIR (corsair_alt)
- Description: Oldest species with conserved isoform
- Type: Continuous (0-0.25)
- Scale: Bilateria (oldest) to recent vertebrates
- Source: APPRIS database
- Species Support: Cross-species
norm_corsair_alt
- Description: Gene-normalized Alt-CORSAIR
- Type: Continuous (0-1)
- Code:
trifid/data/feature_engineering.py:76-92
PhyloCSF Features
PhyloCSF measures evolutionary signatures of protein-coding regions using comparative genomics.ScorePerCodon
- Description: PhyloCSF score normalized per codon
- Type: Continuous (can be negative)
- Method: Log-likelihood ratio of coding vs. non-coding evolutionary model
- Interpretation: Higher = stronger coding signature
- Species Support: GRCh38, GRCm38
norm_ScorePerCodon
- Description: Gene-normalized ScorePerCodon
- Type: Continuous (0-1)
- Handling: Missing values imputed at 3rd percentile
- Code:
trifid/data/feature_engineering.py:94-114
PhyloCSF_Psi
- Description: Length-adjusted PhyloCSF score
- Type: Continuous
- Method: Adjusts for coding sequence length
- Species Support: GRCh38, GRCm38
norm_PhyloCSF_Psi
- Description: Gene-normalized PhyloCSF_Psi
- Type: Continuous (0-1)
RelBranchLength
- Description: Minimum relative branch length in phylogenetic alignment
- Type: Continuous (0-1)
- Method: Lowest exon-based score showing species coverage
- Interpretation: Higher = better alignment quality
- Species Support: GRCh38, GRCm38
norm_RelBranchLength
- Description: Gene-normalized RelBranchLength
- Type: Continuous (0-1)
Structure Features
Protein structure predictions from APPRIS methods.firestar
- Description: Number of functional residues detected
- Type: Continuous (integer)
- Method: Identifies ligand-binding and catalytic residues
- Source: APPRIS firestar module
- Species Support: Cross-species
norm_firestar
- Description: Gene-normalized firestar score
- Type: Continuous (0-1)
- Code:
trifid/data/feature_engineering.py:35-47
matador3d
- Description: Exon-structure mapping score
- Type: Continuous
- Method: Number of exons mapping to known 3D structures
- Source: APPRIS Matador3D module
- Species Support: Cross-species
norm_matador3d
- Description: Gene-normalized Matador3D score
- Type: Continuous (0-1)
SPADE (spade)
- Description: Pfam domain integrity score
- Type: Continuous
- Method: Sum of Pfam alignment bitscores
- Source: APPRIS SPADE module
- Interpretation: Higher = more complete domains
- Species Support: Cross-species
norm_spade
- Description: Gene-normalized SPADE score
- Type: Continuous (0-1)
spade_loss
- Description: Domain loss relative to principal isoform
- Type: Continuous (capped at 50)
- Formula:
max_gene_spade - spade - Code:
trifid/data/feature_engineering.py:130
norm_spade_loss
- Description: Gene-normalized SPADE loss
- Type: Continuous (0-1)
- Formula:
1 - norm_spade_loss(inverted) - Code:
trifid/data/feature_engineering.py:132-133
CRASH Scores (crash_p, crash_m)
- Description: Signal peptide reliability scores
- crash_p: Secretory signal sequence
- crash_m: Mitochondrial signal sequence
- Type: Continuous
- Source: APPRIS CRASH module
- Species Support: Cross-species
norm_crash_p, norm_crash_m
- Description: Gene-normalized CRASH scores
- Type: Continuous (0-1)
THUMP (thump)
- Description: Number of transmembrane helices
- Type: Integer (0-10+)
- Source: APPRIS THUMP module
- Species Support: Cross-species
norm_thump
- Description: Gene-normalized THUMP score
- Type: Continuous (0-1)
Splicing Features
Pfam domain impact from alternative splicing (QPfam analysis).pfam_score
- Description: Percentage of residues in intact Pfam domains
- Type: Continuous (0-100)
- Source: QPfam module
- Method: Custom Pfam domain analysis pipeline
- Species Support: Cross-species
pfam_domains_impact_score
- Description: Percentage of intact Pfam domains after splicing
- Type: Continuous (0-100)
- Source: QPfam module
perc_Damaged_State
- Description: Percentage of Pfam domains damaged by splicing
- Type: Continuous (0-100)
- Method: Domains partially lost or truncated
- Source: QPfam module
perc_Lost_State
- Description: Percentage of Pfam domains completely lost
- Type: Continuous (0-100)
- Source: QPfam module
norm_Lost_residues_pfam
- Description: Normalized count of lost Pfam residues
- Type: Continuous (0-1)
- Special handling: Values < 10 residues set to 0
- Code:
trifid/data/feature_engineering.py:128
norm_Gain_residues_pfam
- Description: Normalized count of gained Pfam residues
- Type: Continuous (0-1)
- Source: QPfam module
What is QPfam?
What is QPfam?
QPfam is a custom module that analyzes how alternative splicing affects Pfam protein domains. It identifies:
- Intact domains: Fully preserved
- Damaged domains: Partially truncated
- Lost domains: Completely removed
- Gained domains: New domains added
Expression Features
RNA-seq splice junction support (QSplice analysis).RNA2sj_cds
- Description: Splice junction read support ratio
- Type: Continuous (0-1+)
- Method: Min junction reads / average CDS junction reads
- Source: QSplice module analyzing GTEx RNA-seq data
- Interpretation: Higher = better junction support
- Species Support: GRCh38 only
- Code:
trifid/data/feature_engineering.py:137-155
norm_RNA2sj_cds
- Description: Gene-normalized RNA2sj_cds
- Type: Continuous (0-1)
Feature Engineering
TRIFID applies sophisticated transformations to raw features.Group Normalization
Delta Scoring
Fragment Correction
One-Hot Encoding
Categorical features (TSL, Level) are one-hot encoded:Missing Data Handling
Different strategies for different assembly types (seetrifid/models/predict.py:32-136):
- GRCh38 Ensembl: Most complete, minimal imputation
- GRCh37: Fill missing columns with -1
- RefSeq annotations: CCDS computed from ccdsid, missing features = -1
- Mouse/Rat/Other species: RNA-seq features = -1, CCDS handled separately
- Non-mammalian species: More extensive imputation with -1 or 0
Imputation Strategy
- -1: Feature unavailable for this assembly (not missing data)
- 0: Biological absence (e.g., no CCDS identifier, no junction support)
- Percentile: PhyloCSF features imputed at 3rd percentile
- Gene-level: Fragment correction uses within-gene imputation
Feature Importance
Top features contributing to TRIFID predictions (from SHAP analysis):- norm_corsair: Cross-species conservation
- length_delta_score: Length similarity to principal
- norm_ScorePerCodon: PhyloCSF coding strength
- norm_spade: Pfam domain integrity
- CCDS: Consensus coding sequence membership
- tsl_1: Highest transcript support level
- norm_firestar: Functional residues
- pfam_score: Domain coverage
- norm_RNA2sj_cds: Junction support
- perc_Lost_State: Domain loss percentage
See the Interpretability page for detailed SHAP analysis of feature contributions.
Feature Summary Table
| Category | Count | Key Features |
|---|---|---|
| Annotation | 12 | CCDS, TSL, Level, basic, length |
| Evolution | 8 | CORSAIR, PhyloCSF (3), RelBranchLength |
| Structure | 14 | SPADE, firestar, Matador3D, CRASH, THUMP |
| Splicing | 6 | Pfam domain impacts, residue gains/losses |
| Expression | 2 | RNA-seq junction support |
| Derived | 20+ | Normalized versions of above |
| TOTAL | 45+ |
Next Steps
Interpretability
Learn how SHAP explains which features drive predictions
Feature Configuration
Customize feature selection for your analysis
Data Preparation
Prepare your own feature data
Model Overview
Back to pipeline overview