Overview
Thelabel_fragments module labels genome isoforms based on their structural characteristics, identifying duplications, fragments, readthrough transcripts (RT), and nonsense-mediated decay (NMD) candidates. It uses GTF annotations, protein sequences, and APPRIS principal isoform data to classify transcripts.
Usage
Command-Line Arguments
Path to GTF annotation file (GENCODE or Ensembl). Can be gzip compressed.
Protein sequences file in FASTA format from APPRIS/GENCODE (gzip compressed allowed).
Path to APPRIS principal isoforms file listing reference transcripts per gene.
Output directory where results and intermediate files will be stored.
Optional path to existing TRIFID scores file (from previous version). Used to sort APPRIS annotations before labeling.
If set, removes intermediate files after processing.
Core Functions
generate_annotations
- protein_coding
- nonsense_mediated_decay
- non_stop_decay
- polymorphic_pseudogene
- IG (immunoglobulin)
- TR (T-cell receptor)
readthrough- Readthrough transcriptsNF- Not found (Start NF or End NF)transcript_support_level- TSL annotation
Path to GTF annotation file
pd.DataFrame - GTF annotations with columns:
gene_name- Gene symbolgene_id- Gene identifiertranscript_id- Transcript identifiergene_type- Gene type (filtered to protein_coding)transcript_type- Transcript type/biotypereadthrough- Readthrough tag if presentNF- Not found tag (Start NF or End NF)transcript_support_level- TSL value
generate_sequences
Path to protein FASTA file
pd.DataFrame - Protein sequences with columns:
gene_id- Gene identifiertranscript_id- Transcript identifiersequence- Protein sequence
get_seqlen
Directory containing intermediate annotation files
str - Command line string to execute Perl script
get_NR_list
Directory containing processed annotation files
str - Command line string to execute Perl script
Output Files
The module generates several output files in the specified output directory:gencode.qduplications.tsv.gz
Main output file containing duplication and fragment labels for all transcripts. This file is used in TRIFID scoring to filter or adjust scores for fragmented/duplicated isoforms. Expected columns:gene_id- Gene identifiertranscript_id- Transcript identifiersequence- Protein sequencelength- Sequence lengthlabel- Classification label (Fragment, Duplication, RT, NMD, etc.)
appris.pc_sequences.tsv
Protein-coding sequences extracted from FASTA (removed if--rm is set).
gencode.pc_annotations.tsv
Protein-coding annotations from GTF (removed if--rm is set).
gencode.pc_annotations.out.tsv
Annotations with sequence lengths added (removed if--rm is set).
appris.principals.tsv
Filtered APPRIS principal isoforms (removed if--rm is set).
Transcript Classification Labels
The module identifies several types of potentially problematic isoforms:Fragment
Transcripts that are significantly shorter than other isoforms of the same gene, potentially representing incomplete or truncated proteins.Duplication
Transcripts with sequences that duplicate or closely match sequences from other transcripts, possibly representing annotation artifacts.Readthrough (RT)
Transcripts that span multiple genes due to readthrough transcription, tagged in GTF asreadthrough_transcript.
NMD (Nonsense-Mediated Decay)
Transcripts annotated asnonsense_mediated_decay in the GTF, predicted to be degraded by NMD pathway.
NF (Not Found)
Transcripts with incomplete 5’ or 3’ ends, tagged asstart_NF or end_NF in GTF.
Workflow
The module follows this processing pipeline:- Generate sequences - Extract protein sequences from FASTA
- Generate annotations - Extract relevant annotations from GTF
- Filter principals - Extract APPRIS principal isoforms
- Calculate sequence lengths - Run Perl script to add length information
- Sort by TRIFID (optional) - If existing TRIFID scores provided, use them for sorting
- Identify duplications - Run Perl script to label fragments and duplications
- Compress output - Gzip final results file
Dependencies
- Perl - Required for running sequence length and NR list scripts
- gtfparse - Python library for parsing GTF files
- APPRIS data - Principal isoform annotations
Integration with TRIFID
The duplication labels from this module are used in TRIFID to:- Filter out fragmented isoforms from scoring
- Adjust scores for duplicated sequences
- Identify potentially unreliable isoforms
- Prioritize high-quality, complete protein-coding transcripts
Notes
- The module specifically filters for protein-coding genes only
- Transcript identifiers are normalized (version numbers removed for GENCODE)
- Readthrough transcripts and NF-tagged transcripts are flagged during annotation
- The Perl scripts perform the actual duplication detection using sequence similarity
- If TRIFID scores are provided, they are used to prioritize transcripts before labeling