Overview
Thepfam_effects module quantifies the effects of alternative splicing on Pfam protein domains. It performs multiple sequence alignment (MSA) between reference and alternative isoforms, analyzes domain integrity, and generates quantitative scores representing structural impact.
Usage
Command-Line Arguments
Path to APPRIS scores data file. Used for reference transcript selection and annotation.
Protein sequences file in FASTA format (gzip compressed allowed). Contains translated protein sequences for all isoforms.
Path to SPADE file (GTF or TSV format) containing Pfam domain annotations per transcript.
Output directory where results and intermediate files will be stored.
Number of CPUs to use for parallel processing of MSA alignments.
If set, removes intermediate files (muscle alignments, spade files, etc.) after processing.
Core Functions
annotation_reference
- Protein coding label
- Best SPADE (APPRIS score)
- TSL 1 (Transcript Support Level 1)
- CCDS identifier
- Highest number of residues
- Lowest CCDS number
- APPRIS tag
Path to APPRIS data file
Path to protein FASTA sequences
Output directory for saving reference files
Whether to save intermediate reference files
pd.DataFrame - Reference annotations with columns: gene_id, transcript_id, pfam_effects_msa, appris, sequence
make_spade
Path to APPRIS SPADE file (.gtf.gz or .txt format)
Directory to store individual Pfam domain files (creates
spade3/ subdirectory)pd.DataFrame - SPADE data with columns: transcript_id, hmm_name, pep_start, pep_end, feature, score
mp_msa
Path to APPRIS data file
Path to protein FASTA sequences
Output directory for MSA results (creates
muscle/ subdirectory)Number of CPUs for parallel processing
analyse_transcripts
Directory containing muscle alignments and spade files
Path for Pfam_effects.tsv output file
str - Command line string to execute Perl script
load_pfam
- Separates alternative and reference transcripts
- Converts pfam_score to 0-1 scale (1 = least affected, 0 = totally damaged)
- Counts event types (Deletion, Insertion, Substitution, etc.)
- Counts domain states (Lost, Damaged, Intact)
- Calculates residue loss/gain metrics
Path to Pfam_effects.tsv file generated by analyse_transcripts
pd.DataFrame - Processed Pfam effects with aggregated scores per transcript
Output columns include:
- Event type counts:
{EventType}_Event_type(e.g., Deletion_Event_type, Insertion_Event_type) - State counts:
{State}_State(Lost_State, Damaged_State, Intact_State) n_events- Number of Pfam domain events per transcriptpfam_score- Minimum integrity score (0-1)Lost_residues_total- Total residues lost/swappedGain_residues_total- Total residues gained (insertions)Lost_residues_pfam- Residues lost within Pfam domainsGain_residues_pfam- Residues gained within Pfam domains
qpfam_effects
Reference annotations from annotation_reference function
SPADE domain annotations from make_spade function
Path to Pfam_effects.tsv file
pd.DataFrame - Final quantification with columns:
gene_id- Gene identifiertranscript_id- Transcript identifierpfam_score- Domain integrity score (0-1, higher is better)pfam_domains_impact_score- Overall domain impact scoreperc_Damaged_State- Percentage of domains damagedperc_Lost_State- Percentage of domains lostLost_residues_pfam- Number of residues lost in Pfam domainsGain_residues_pfam- Number of residues gained in Pfam domainspfam_effects_msa- Whether transcript is “Reference” or “Transcript”appris- APPRIS annotation tag
Output Files
The module generates several output files in the specified output directory:qpfam.tsv.gz
Main output file containing quantitative Pfam domain scores per transcript. This is the primary result used in TRIFID scoring.Pfam_effects.tsv
Raw Pfam effects data before aggregation (removed if--rm is set).
spade_references.tsv
List of reference transcripts selected per gene (removed if--rm is set).
pc_translations.tsv
Protein translations for all transcripts (removed if--rm is set).
muscle/ directory
Contains ClustalW format MSA alignments for all gene isoforms (removed if--rm is set).
spade3/ directory
Individual Pfam domain files per transcript (removed if--rm is set).
Pfam Domain States
The module classifies Pfam domain effects into three states:Lost
The entire Pfam domain is absent in the alternative isoform.Damaged
The Pfam domain is present but has structural changes (deletions, insertions, or substitutions).Intact
The Pfam domain is completely preserved in the alternative isoform.Event Types
The module identifies several types of splicing events affecting domains:- Deletion - Residues removed from domain
- Insertion - Residues added to domain
- Substitution - Residues replaced in domain
- C-terminal swap - C-terminus altered
- N-terminal swap - N-terminus altered
- C-terminal Deletion - C-terminus truncated
- N-terminal Deletion - N-terminus truncated
- NAGNAG - NAGNAG alternative splicing event
- Two Proteins - Isoform produces two separate proteins
- Homology - Homologous domain variation
Dependencies
- MUSCLE - Multiple sequence alignment tool (must be installed and in PATH)
- Perl - Required for running the Pfam effects analysis script
- APPRIS data - Reference transcript annotations
- SPADE data - Pfam domain annotations
Notes
- The MSA alignment uses MUSCLE in paired mode (each alternative vs. its reference)
- Reference transcripts are selected using a hierarchical priority system
- Domain integrity scores range from 0 (completely damaged) to 1 (fully intact)
- The
pfam_scoreis set to 0 if any domain is in “Lost” state - Transcripts with fewer than 10 lost residues in Pfam domains are normalized to 0