Skip to main content

Overview

TRIFID integrates data from multiple external databases to generate comprehensive predictive features for splice isoform functionality assessment. This page describes each data source and how to access them.

Primary Annotation Sources

GENCODE

Description: The reference human gene annotation database providing comprehensive gene and transcript annotations. Usage in TRIFID: Source of protein-coding transcript annotations, exon structures, and genomic coordinates. Access: File Types Used:
  • GTF annotations: gencode.vXX.annotation.gtf.gz
  • GFF3 annotations: gencode.vXX.annotation.gff3.gz
TRIFID was initially trained on GENCODE Release 27 (GRCh38.p10) but now supports multiple releases.

Ensembl

Description: Genome browser and annotation database for vertebrate and model organism genomes. Usage in TRIFID: Primary annotation source for non-human species including rat, zebrafish, chicken, and other vertebrates. Access:
  • Website: ensembl.org
  • FTP Server: Available through Ensembl FTP
Species Supported: Rat, zebrafish, chicken, chimpanzee, pig, cow, macaque, fruitfly, and worm.

RefSeq (NCBI)

Description: NCBI Reference Sequence database providing curated genomic, transcript, and protein sequences. Usage in TRIFID: Alternative annotation source for human genome (GRCh37 and GRCh38). Access:

Protein Structural and Functional Data

APPRIS

Description: Database annotating principal and alternative splice isoforms using protein structural and functional information. Usage in TRIFID: Provides multiple features including:
  • Principal isoform annotations
  • Protein structural integrity scores
  • Functionally important residue annotations
  • Cross-species conservation evidence
  • ALT-Corsair evolutionary age scores
Access: File Types Used:
  • appris_data.principal.txt: Principal isoform labels
  • appris_data.appris.txt: APPRIS method scores
  • appris_data.transl.fa.gz: Protein sequences
  • appris_method.spade.gtf.gz: Pfam domain annotations
Download Example (GENCODE 27):
mkdir -p data/external/appris/GRCh38/g27
curl http://apprisws.bioinfo.cnio.es/pub/current_release/datafiles/homo_sapiens/e90v35/appris_data.principal.txt \
  -o data/external/appris/GRCh38/g27/appris_data.principal.txt

Pfam

Description: Database of protein families and domains. Usage in TRIFID: Used to quantify the impact of alternative splicing on protein domains. TRIFID’s Pfam effects module calculates whether domains are damaged, lost, or intact. Access: Pre-computed Scores:

Evolutionary Conservation Data

PhyloCSF

Description: Tool for measuring evolutionary conservation of coding sequences using codon substitution frequencies. Usage in TRIFID: Complementary measure of evolutionary conservation across species. Access:

ALT-Corsair

Description: APPRIS module that quantifies the evolutionary age of orthologs by identifying the last common ancestor of the most distant ortholog. Usage in TRIFID: Reports evolutionary age scores representing conservation across species. Access:

Expression Data

E-MTAB-2836 (RNA-seq)

Description: Large-scale RNA-seq expression dataset from 122 human individuals across 32 tissues. Usage in TRIFID: Source data for QSplice module to quantify splice junction coverage. Access: Pre-computed Scores:
Processing RNA-seq data requires the APPRIS RNA-seq pipeline.

Training Data

Proteomics Evidence (Kim et al., 2014)

Description: Large-scale proteomics study providing experimental evidence for protein isoforms. Usage in TRIFID: Gold standard training set for the machine learning model. Citation: Kim et al., 2014, Nature Pre-computed Training Set:

Pre-computed TRIFID Module Outputs

For convenience, pre-computed outputs from TRIFID modules are available:
ModuleDescriptionData Link
QSpliceSplice junction coverage scoresDownload
Pfam EffectsDomain impact scoresDownload
Fragment LabellingRedundancy labelsDownload
PhyloCSFConservation scoresDownload
ALT-CorsairEvolutionary age scoresDownload

Centralized Data Repository

All source data files needed to reproduce TRIFID analysis are available in the centralized sharepoint: Google Drive - TRIFID Source Data

Feature Documentation

Detailed descriptions of all 47 predictive features are available: Feature Documentation PDF

Build docs developers (and LLMs) love