Overview
TRIFID integrates data from multiple external databases to generate comprehensive predictive features for splice isoform functionality assessment. This page describes each data source and how to access them.Primary Annotation Sources
GENCODE
Description: The reference human gene annotation database providing comprehensive gene and transcript annotations. Usage in TRIFID: Source of protein-coding transcript annotations, exon structures, and genomic coordinates. Access:- Website: gencodegenes.org
- FTP Server: ftp.ebi.ac.uk/pub/databases/gencode
- Statistics: GENCODE annotation statistics
- GTF annotations:
gencode.vXX.annotation.gtf.gz - GFF3 annotations:
gencode.vXX.annotation.gff3.gz
TRIFID was initially trained on GENCODE Release 27 (GRCh38.p10) but now supports multiple releases.
Ensembl
Description: Genome browser and annotation database for vertebrate and model organism genomes. Usage in TRIFID: Primary annotation source for non-human species including rat, zebrafish, chicken, and other vertebrates. Access:- Website: ensembl.org
- FTP Server: Available through Ensembl FTP
RefSeq (NCBI)
Description: NCBI Reference Sequence database providing curated genomic, transcript, and protein sequences. Usage in TRIFID: Alternative annotation source for human genome (GRCh37 and GRCh38). Access:- Website: ncbi.nlm.nih.gov/refseq
Protein Structural and Functional Data
APPRIS
Description: Database annotating principal and alternative splice isoforms using protein structural and functional information. Usage in TRIFID: Provides multiple features including:- Principal isoform annotations
- Protein structural integrity scores
- Functionally important residue annotations
- Cross-species conservation evidence
- ALT-Corsair evolutionary age scores
- Website: appris.bioinfo.cnio.es
- HTTP Server: appris.bioinfo.cnio.es
- Methods: APPRIS methods documentation
appris_data.principal.txt: Principal isoform labelsappris_data.appris.txt: APPRIS method scoresappris_data.transl.fa.gz: Protein sequencesappris_method.spade.gtf.gz: Pfam domain annotations
Pfam
Description: Database of protein families and domains. Usage in TRIFID: Used to quantify the impact of alternative splicing on protein domains. TRIFID’s Pfam effects module calculates whether domains are damaged, lost, or intact. Access:- Website: pfam.xfam.org
Evolutionary Conservation Data
PhyloCSF
Description: Tool for measuring evolutionary conservation of coding sequences using codon substitution frequencies. Usage in TRIFID: Complementary measure of evolutionary conservation across species. Access:- GitHub: github.com/mlin/PhyloCSF
- Pre-computed scores: Available for multiple genome versions
ALT-Corsair
Description: APPRIS module that quantifies the evolutionary age of orthologs by identifying the last common ancestor of the most distant ortholog. Usage in TRIFID: Reports evolutionary age scores representing conservation across species. Access:- Available through APPRIS webserver
- Pre-computed scores: ALT-Corsair dataset
- Based on: Corsair method
Expression Data
E-MTAB-2836 (RNA-seq)
Description: Large-scale RNA-seq expression dataset from 122 human individuals across 32 tissues. Usage in TRIFID: Source data for QSplice module to quantify splice junction coverage. Access:- ArrayExpress: E-MTAB-2836
Processing RNA-seq data requires the APPRIS RNA-seq pipeline.
Training Data
Proteomics Evidence (Kim et al., 2014)
Description: Large-scale proteomics study providing experimental evidence for protein isoforms. Usage in TRIFID: Gold standard training set for the machine learning model. Citation: Kim et al., 2014, Nature Pre-computed Training Set:- TRIFID training set (GENCODE 27)