Overview
Fragment Labeling is a TRIFID preprocessing module that identifies redundant, incomplete, and duplicated protein sequences in genome annotations. This module is essential for cleaning up the isoform dataset before machine learning model training. By tagging transcripts as fragments, duplications, or complete alternatives, this module ensures that TRIFID scores are not artificially inflated or deflated by annotation artifacts.Why Fragment Labeling Matters
Genome annotations often contain:- Incomplete sequences - CDS start/end not found (NF = “Not Found”)
- Fragments - Truncated versions of longer isoforms from the same gene
- Duplications - Identical protein sequences shared across multiple genes
- Readthrough transcripts - Fusions of adjacent genes
- Confound functional predictions
- Create biased training datasets
- Misrepresent evolutionary conservation
- Complicate interpretation of isoform importance
Command-Line Usage
Parameters
| Parameter | Flag | Description |
|---|---|---|
--gtf | -g | GENCODE/Ensembl GTF annotation file (gzipped) |
--seqs | -s | Protein sequences in FASTA format (gzipped) |
--principals | -p | APPRIS principal isoforms file |
--outdir | -o | Output directory for results |
--trifid | -t | Optional: TRIFID predictions to sort APPRIS by score |
--rm | -r | Remove intermediate files after completion |
Input Files
Required Inputs
-
GTF annotation (
gencode.v27.annotation.gtf.gz)- Transcript-level annotations with tags:
readthrough_transcriptcds_start_NF/cds_end_NF
- Transcript types (protein_coding, NMD, etc.)
- Transcript-level annotations with tags:
-
Protein sequences (
appris_data.transl.fa.gz)- FASTA format with gene and transcript IDs
- Used for sequence comparison and length calculation
-
APPRIS principals (
appris_data.principal.txt)- List of principal isoforms (one per gene)
- Used to prioritize reference sequences
Optional Input
- TRIFID predictions (with
--trifid)- Allows sorting isoforms by predicted functional score
- Improves reference selection when APPRIS is ambiguous
Output Files
Main Output: gencode.qduplications.tsv.gz
One row per transcript with redundancy labels.
Expected columns:
| Column | Description |
|---|---|
gene_id | Gene identifier |
transcript_id | Transcript identifier |
sequence | Protein sequence (amino acids) |
length | Protein length in amino acids |
ann_label | Annotation-derived label |
duplication_label | Final redundancy classification |
appris | APPRIS principal/alternative status |
Duplication Labels
The module assigns one of these labels:| Label | Description |
|---|---|
Principal | APPRIS principal isoform (one per gene) |
Alternative | Complete alternative isoform |
Redundant Principal | Incomplete principal (CDS start/end NF) |
Redundant Alternative | Incomplete alternative isoform |
Principal Duplication | Principal with sequence duplicated across genes |
Alternative Duplication | Alternative with sequence duplicated across genes |
Annotation Labels
Extracted from GTF tags:| ann_label | Meaning |
|---|---|
readthrough | Readthrough transcript (gene fusion) |
Start NF | CDS start not found |
End NF | CDS end not found |
protein_coding | Standard protein-coding transcript |
nonsense_mediated_decay | NMD target |
IG_* / TR_* | Immunoglobulin/T-cell receptor genes |
Intermediate Files (removed with --rm)
appris.pc_sequences.tsv- Protein sequences formatted for Perl processinggencode.pc_annotations.tsv- Filtered protein-coding annotationsgencode.pc_annotations.out.tsv- Post-Perl processing annotationsappris.principals.tsv- Filtered principal isoforms
How Fragment Labeling Works
Step 1: Load and Filter Annotations
From the GTF file:- Extract transcript-level features
- Filter for protein-coding gene types
- Identify transcripts with:
readthrough_transcripttagcds_start_NForcds_end_NFtags- Transcript support levels (TSL)
- Transcript types (protein_coding, NMD, non_stop_decay, etc.)
Step 2: Load Protein Sequences
From the FASTA file:- Parse gene_id, transcript_id, and sequence
- Calculate protein length
- Remove version suffixes if needed (e.g., ENST00000456328.2 → ENST00000456328)
Step 3: Identify APPRIS Principals
Filter the principals file to retain only isoforms labeledPRINCIPAL.
Step 4: Calculate Sequence Lengths (Perl)
Callget_seqlen.pl to:
- Compute amino acid lengths for all transcripts
- Identify incomplete sequences (NF tags)
Step 5: Identify Non-Redundant (NR) Set (Perl)
Callget_NR_list.pl to:
- Compare all isoforms within each gene
- Detect fragments (shorter isoforms that are substrings of longer ones)
- Detect duplications (identical sequences across genes)
- Assign duplication labels based on sequence identity
Step 6: Sort by TRIFID Score (Optional)
If--trifid is provided:
- Merge with TRIFID predictions
- Sort isoforms by
trifid_scorewithin each gene - Use this ordering to prioritize reference selection
Fragmentation Detection Algorithm
The Perl scriptget_NR_list.pl implements these rules:
Within-Gene Comparison
For each gene:- Sort isoforms by length (descending)
- Compare each shorter isoform to longer ones
- If sequence is a perfect substring, label as fragment:
- If APPRIS principal →
Redundant Principal - If alternative →
Redundant Alternative
- If APPRIS principal →
Cross-Gene Comparison
Across all genes:- Compare all sequences pairwise
- If identical sequences found:
- Label as
Principal DuplicationorAlternative Duplication - Retain only one copy (preferring higher APPRIS/TRIFID score)
- Label as
Incomplete Sequence Handling
Isoforms with NF tags:- Automatically labeled as
Redundant [Principal|Alternative] - Excluded from training set (but retained in predictions)
Example Use Cases
Case 1: Fragment Detection
Gene EXMP1 has three isoforms:- ENST001 →
Principal - ENST002 →
Alternative - ENST003 →
Redundant Alternative(fragment of ENST001/002)
Case 2: Duplication Detection
Gene A and Gene B share identical sequences:- ENST100 →
Principal(higher APPRIS score) - ENST200 →
Principal Duplication(flagged as redundant)
Case 3: Incomplete Sequences
Gene EXMP2 has NF-tagged isoforms:- ENST301 →
Redundant Principal(incomplete CDS) - ENST302 →
Alternative(complete)
Integration with TRIFID
Fragment labels are used for:Training Set Filtering
Remove from training:- All
Redundant *isoforms - All
* Duplicationisoforms
PrincipalAlternative
Score Correction
Adjust predictions for incomplete sequences:- Fragments receive penalty in final scoring
- NF-tagged isoforms marked for manual review
Interpretation
- Users can filter predictions to exclude redundant isoforms
- Annotations can indicate which isoforms are likely artifacts
Pre-computed Data
For GENCODE 27:- gencode.qduplications.tsv.gz - Fragment labels
Technical Notes
Performance
- Processing ~200,000 transcripts: ~10-30 minutes
- Memory usage: ~1-2 GB
- Bottleneck: Cross-gene sequence comparison (O(n²) for duplicates)
GTF Parsing
Uses thegtfparse library:
- Automatically extracts transcript-level features
- Handles GENCODE/Ensembl GTF format differences
- Supports gzipped inputs
Transcript Type Filtering
Included transcript types:protein_codingnonsense_mediated_decaynon_stop_decaypolymorphic_pseudogeneIG_*(immunoglobulin genes)TR_*(T-cell receptor genes)
- Pseudogenes
- Processed transcripts
- Retained introns
- lncRNAs
Version Number Handling
Ensembl/GENCODE IDs may have version suffixes (e.g.,.2):
- Automatically removed for consistency
- Ensures matching across APPRIS/GENCODE datasets
Dependencies
- Perl 5 - For fragment detection scripts (
get_NR_list.pl,get_seqlen.pl) - Python packages: pandas, gtfparse
Source Code
Implementation:trifid/preprocessing/label_fragments.py:label_fragments.py:1
Key functions:
generate_annotations()- Extract GTF metadatagenerate_sequences()- Parse protein FASTAget_seqlen()- Call Perl script for length calculationget_NR_list()- Call Perl script for redundancy detection
trifid/utils/get_seqlen.pl- Calculate sequence lengthstrifid/utils/get_NR_list.pl- Detect fragments and duplications
Frequently Asked Questions
Why are some principal isoforms labeled as redundant?
If an APPRIS principal isoform hascds_start_NF or cds_end_NF tags, it’s incomplete and flagged as Redundant Principal. This prevents incomplete sequences from being used as training references.
What happens to readthrough transcripts?
Readthrough transcripts (gene fusions) are tagged withreadthrough in the annotation label but not automatically excluded. These may represent genuine functional transcripts.
How are ties handled in duplication detection?
When multiple isoforms have identical sequences:- APPRIS label (Principal > Alternative)
- TRIFID score (if provided via
--trifid) - Transcript ID (lexicographic order)
Can I use this module for non-human species?
Yes, as long as:- GTF follows GENCODE/Ensembl format
- APPRIS annotations are available
- The species has Pfam domain annotations
Next Steps
After running Fragment Labeling:- Review the output: Check duplication counts and fragment percentages
- Filter for training: Remove redundant isoforms from the training set
- Run TRIFID: Use the cleaned dataset for model training
- Interpret predictions: Consider fragment labels when interpreting scores
Related Concepts
- APPRIS: Principal isoform annotation database
- GENCODE: Comprehensive human genome annotation
- TSL: Transcript Support Level (experimental validation)
- CCDS: Consensus Coding Sequence project
- NMD: Nonsense-Mediated Decay pathway