Overview
Theloaders module provides classes and functions for loading genomic data from various annotation sources including GENCODE, RefSeq, Ensembl, APPRIS, and specialized scoring systems.
Classes
Fasta
Manage and load FASTA sequence files into pandas DataFrames.Path to the FASTA file
Database name for identifier parsing
load: Returns a pandas DataFrame with columnsidandsequence
GFF
Parse GFF (General Feature Format) annotation files.Path to the GFF file
Database type:
"gencode", "g", "gn", "refseq", "r", "rs", "uniprot", "u", "up"load(complete=False): Returns parsed GFF data as pandas DataFrame
GTF
Manage and parse GTF (Gene Transfer Format) annotation files.Path to the GTF file
Database type:
"gencode", "g", "gn", "refseq", "r", "rs", "ensembl", "e"load: Returns parsed GTF data as pandas DataFrame
Functions
load_annotation
Load GTF genome annotations with parsed features and quality indicators.Path to compressed GTF file (.gz format)
Database type:
"g" for GENCODE or "e" for Ensemblpd.DataFrame
DataFrame with columns including:
transcript_id: Transcript identifier (version stripped)CCDS: Boolean for Consensus CDS annotationStartEnd_NF: Boolean for unconfirmed start/end regionsRNA_supported: Boolean for RNA-seq supportbasic: Boolean for basic transcript set membershipNAGNAG: Boolean for alternative acceptor sitesreadthrough: Boolean for readthrough transcriptsnonsense_mediated_decay: Boolean for NMD transcriptslevel: Annotation confidence level
load_appris
Load APPRIS transcript annotations and functional scores.Path to APPRIS data file from APPRIS Web Server
pd.DataFrame
DataFrame with columns:
gene_id,gene_name: Gene identifierstranscript_id,translation_id: Transcript and protein IDsccdsid: CCDS identifiertsl: Transcript support level (1-5, or 6 for NA)length: Protein lengthfirestar: Functional residue scorematador3d: 3D structure scorecorsair: Cross-species conservation score (normalized)spade: Domain prediction scorethump: Transmembrane/signal peptide scorecrash_p,crash_m: Signal peptide and mitochondrial scoresappris: APPRIS annotation category
- Filters for transcripts with TRANSLATION flag
- Separates crash scores into peptide (crash_p) and mitochondrial (crash_m)
- Normalizes corsair scores (values ≤1.5 set to 0)
- Strips version numbers from transcript IDs
load_corsair_alt
Load ALT-Corsair alternative transcript conservation scores.Path to compressed ALT-Corsair score file
pd.DataFrame
DataFrame with columns:
transcript_id: Transcript identifier (version stripped)corsair_alt: Alternative conservation score (NaN filled with 0)
load_corsair_alt_exons
Load ALT-CorsairExons scores aggregated by transcript.Path to preprocessed ALT-CorsairExons file
pd.DataFrame
DataFrame with columns:
transcript_id: Transcript identifierminexon_corsair_alt: Minimum exon conservation score per transcript
load_phylocsf
Load PhyloCSF evolutionary conservation scores.Path to compressed PhyloCSF score file
pd.DataFrame
DataFrame with columns:
transcript_id: Transcript identifier (version stripped)ScorePerCodon: Codon substitution frequency score (minimum per transcript)RelBranchLength: Branch length estimation (minimum per transcript)PhyloCSF_Psi: Alternative CSF representation (minimum per transcript)
- Filters for exons with RelBranchLength > 0.1 and NumCodons > 3
- Aggregates multi-exon scores by taking minimum values
- NMD variants not included in source data
load_qpfam
Load QPfam protein domain impact scores.Path to compressed QPfam output file
pd.DataFrame
DataFrame with columns:
transcript_id: Transcript identifierpfam_score: Overall Pfam domain scorepfam_domains_impact_score: Domain impact quantificationperc_Damaged_State: Percentage of damaged domainsperc_Lost_State: Percentage of lost domainsLost_residues_pfam: Number of lost Pfam residuesGain_residues_pfam: Number of gained Pfam residues
load_qsplice
Load QSplice splice junction coverage scores.Path to compressed QSplice output file
pd.DataFrame
DataFrame with columns:
transcript_id: Transcript identifier (version stripped)RNA2sj: RNA-seq to splice junction mapping score (capped at 1)RNA2sj_cds: CDS-specific splice junction score (capped at 1)
- Replaces ”-” with NaN, fills with 0
- Caps values >1 at 1
load_reference
Load annotation type reference categories for transcript classification.Path to compressed reference annotation file
pd.DataFrame
DataFrame with columns:
transcript_id: Transcript identifierann_type: Annotation categorytranscript_ref: Reference transcript indicator
- Alternative
- Principal
- Alternative.NMD
- Redundant Principal
- Principal Duplication
- Alternative Duplication
- Redundant Alternative
- Principal.RT (readthrough)
- Principal.NMD
- Alternative.RT
load_sequences
Load protein sequences from FASTA files.Path to compressed FASTA protein sequence file
pd.DataFrame
DataFrame with columns:
transcript_id: Transcript identifier (version stripped)sequence: Protein amino acid sequence
- Uses BioPython’s SeqIO parser
- Extracts transcript ID from FASTA header (pipe-delimited)
- Strips version numbers from IDs
load_spade
Load SPADE (APPRIS domain prediction) scores from GTF format.Path to SPADE GTF file from APPRIS
pd.DataFrame
DataFrame with columns:
seqname,source,feature: GTF standard fieldsstart,end: Genomic coordinatesscore: Domain scorestrand,frame: Strand and reading frametranscript_id,gene_id: Identifiershmm_name: Hidden Markov Model domain nameevalue: E-value for domain predictionpep_start,pep_end: Peptide coordinates