pfam-effects

Overview

The pfam_effects module quantifies the effects of alternative splicing on Pfam protein domains. It performs multiple sequence alignment (MSA) between reference and alternative isoforms, analyzes domain integrity, and generates quantitative scores representing structural impact.

Usage

python -m trifid.preprocessing.pfam_effects \
    --appris ~/data/appris/GRCh38/g27/appris_data.appris.txt \
    --seqs ~/data/appris/GRCh38/g27/appris_data.transl.fa.gz \
    --spade ~/data/appris/GRCh38/g27/appris_method.spade.gtf.gz \
    --outdir data/external/pfam_effects/GRCh38/g27 \
    --jobs 10 \
    --rm

Command-Line Arguments

--appris

string

required

Path to APPRIS scores data file. Used for reference transcript selection and annotation.

--seqs

string

required

Protein sequences file in FASTA format (gzip compressed allowed). Contains translated protein sequences for all isoforms.

--spade

string

required

Path to SPADE file (GTF or TSV format) containing Pfam domain annotations per transcript.

--outdir

string

required

Output directory where results and intermediate files will be stored.

--jobs

integer

required

Number of CPUs to use for parallel processing of MSA alignments.

--rm

boolean

default:"false"

If set, removes intermediate files (muscle alignments, spade files, etc.) after processing.

Core Functions

annotation_reference

def annotation_reference(appris_path: str, fasta_path: str, outdir: str = None, save: bool = None) -> pd.DataFrame

Selects one reference transcript per gene based on priority criteria. Selection Priority:

Protein coding label
Best SPADE (APPRIS score)
TSL 1 (Transcript Support Level 1)
CCDS identifier
Highest number of residues
Lowest CCDS number
APPRIS tag

appris_path

string

Path to APPRIS data file

fasta_path

string

Path to protein FASTA sequences

outdir

string

Output directory for saving reference files

save

boolean

Whether to save intermediate reference files

Returns: pd.DataFrame - Reference annotations with columns: gene_id, transcript_id, pfam_effects_msa, appris, sequence

make_spade

def make_spade(spade_path: str, outdir: str) -> pd.DataFrame

Processes SPADE file and splits Pfam domain annotations into individual transcript files.

spade_path

string

Path to APPRIS SPADE file (.gtf.gz or .txt format)

outdir

string

Directory to store individual Pfam domain files (creates spade3/ subdirectory)

Returns: pd.DataFrame - SPADE data with columns: transcript_id, hmm_name, pep_start, pep_end, feature, score

mp_msa

def mp_msa(appris_path: str, fasta_path: str, outdir: str, cpus: int) -> None

Runs multiple sequence alignment in parallel using MUSCLE for all gene isoforms against their reference.

appris_path

string

Path to APPRIS data file

fasta_path

string

Path to protein FASTA sequences

outdir

string

Output directory for MSA results (creates muscle/ subdirectory)

cpus

integer

Number of CPUs for parallel processing

Returns: None (writes alignment files to disk)

analyse_transcripts

def analyse_transcripts(outdir: str, outfile: str) -> str

Generates Perl command to analyze MUSCLE MSA alignments and quantify Pfam domain effects.

outdir

string

Directory containing muscle alignments and spade files

outfile

string

Path for Pfam_effects.tsv output file

Returns: str - Command line string to execute Perl script

load_pfam

def load_pfam(file: str) -> pd.DataFrame

Loads and processes Pfam effects data, calculating domain integrity scores. Processing steps:

Separates alternative and reference transcripts
Converts pfam_score to 0-1 scale (1 = least affected, 0 = totally damaged)
Counts event types (Deletion, Insertion, Substitution, etc.)
Counts domain states (Lost, Damaged, Intact)
Calculates residue loss/gain metrics

file

string

Path to Pfam_effects.tsv file generated by analyse_transcripts

Returns: pd.DataFrame - Processed Pfam effects with aggregated scores per transcript Output columns include:

Event type counts: {EventType}_Event_type (e.g., Deletion_Event_type, Insertion_Event_type)
State counts: {State}_State (Lost_State, Damaged_State, Intact_State)
n_events - Number of Pfam domain events per transcript
pfam_score - Minimum integrity score (0-1)
Lost_residues_total - Total residues lost/swapped
Gain_residues_total - Total residues gained (insertions)
Lost_residues_pfam - Residues lost within Pfam domains
Gain_residues_pfam - Residues gained within Pfam domains

qpfam_effects

def qpfam_effects(df_reference: pd.DataFrame, df_spade: pd.DataFrame, pfam_effects_filepath: str) -> pd.DataFrame

Quantifies Pfam effects and generates final transcript scores.

df_reference

pd.DataFrame

Reference annotations from annotation_reference function

df_spade

pd.DataFrame

SPADE domain annotations from make_spade function

pfam_effects_filepath

string

Path to Pfam_effects.tsv file

Returns: pd.DataFrame - Final quantification with columns:

gene_id - Gene identifier
transcript_id - Transcript identifier
pfam_score - Domain integrity score (0-1, higher is better)
pfam_domains_impact_score - Overall domain impact score
perc_Damaged_State - Percentage of domains damaged
perc_Lost_State - Percentage of domains lost
Lost_residues_pfam - Number of residues lost in Pfam domains
Gain_residues_pfam - Number of residues gained in Pfam domains
pfam_effects_msa - Whether transcript is “Reference” or “Transcript”
appris - APPRIS annotation tag

Output Files

The module generates several output files in the specified output directory:

qpfam.tsv.gz

Main output file containing quantitative Pfam domain scores per transcript. This is the primary result used in TRIFID scoring.

Pfam_effects.tsv

Raw Pfam effects data before aggregation (removed if --rm is set).

spade_references.tsv

List of reference transcripts selected per gene (removed if --rm is set).

pc_translations.tsv

Protein translations for all transcripts (removed if --rm is set).

muscle/ directory

Contains ClustalW format MSA alignments for all gene isoforms (removed if --rm is set).

spade3/ directory

Individual Pfam domain files per transcript (removed if --rm is set).

Pfam Domain States

The module classifies Pfam domain effects into three states:

Lost

The entire Pfam domain is absent in the alternative isoform.

Damaged

The Pfam domain is present but has structural changes (deletions, insertions, or substitutions).

Intact

The Pfam domain is completely preserved in the alternative isoform.

Event Types

The module identifies several types of splicing events affecting domains:

Deletion - Residues removed from domain
Insertion - Residues added to domain
Substitution - Residues replaced in domain
C-terminal swap - C-terminus altered
N-terminal swap - N-terminus altered
C-terminal Deletion - C-terminus truncated
N-terminal Deletion - N-terminus truncated
NAGNAG - NAGNAG alternative splicing event
Two Proteins - Isoform produces two separate proteins
Homology - Homologous domain variation

Dependencies

MUSCLE - Multiple sequence alignment tool (must be installed and in PATH)
Perl - Required for running the Pfam effects analysis script
APPRIS data - Reference transcript annotations
SPADE data - Pfam domain annotations

Notes

The MSA alignment uses MUSCLE in paired mode (each alternative vs. its reference)
Reference transcripts are selected using a hierarchical priority system
Domain integrity scores range from 0 (completely damaged) to 1 (fully intact)
The pfam_score is set to 0 if any domain is in “Lost” state
Transcripts with fewer than 10 lost residues in Pfam domains are normalized to 0

Preprocessing

Models

Data

Utils

Visualization

Overview

Usage

Command-Line Arguments

Core Functions

annotation_reference

make_spade

mp_msa

analyse_transcripts

load_pfam

qpfam_effects

Output Files

qpfam.tsv.gz

Pfam_effects.tsv

spade_references.tsv

pc_translations.tsv

muscle/ directory

spade3/ directory

Pfam Domain States

Lost

Damaged

Intact

Event Types

Dependencies

Notes

Build docs developers (and LLMs) love

Preprocessing

Models

Data

Utils

Visualization

​Overview

​Usage

​Command-Line Arguments

​Core Functions

​annotation_reference

​make_spade

​mp_msa

​analyse_transcripts

​load_pfam

​qpfam_effects

​Output Files

​qpfam.tsv.gz

​Pfam_effects.tsv

​spade_references.tsv

​pc_translations.tsv

​muscle/ directory

​spade3/ directory

​Pfam Domain States

​Lost

​Damaged

​Intact

​Event Types

​Dependencies

​Notes

Build docs developers (and LLMs) love

Overview

Usage

Command-Line Arguments

Core Functions

annotation_reference

make_spade

mp_msa

analyse_transcripts

load_pfam

qpfam_effects

Output Files

qpfam.tsv.gz

Pfam_effects.tsv

spade_references.tsv

pc_translations.tsv

muscle/ directory

spade3/ directory

Pfam Domain States

Lost

Damaged

Intact

Event Types

Dependencies

Notes