Overview
Theqsplice module quantifies splice junction coverage from RNA-seq data released by STAR and maps it to genome positions. It processes SJ.out.tab files, annotates introns, and generates quantitative splice junction scores per gene and transcript.
Usage
Command-Line Arguments
Custom or GENCODE/Ensembl gff annotation file (gzip compressed)
Output directory where results will be stored
Directory containing SJ.out.tab files to be globbed and concatenated. Pattern:
{samples}/*/SJ.out.tabGenome annotation version. For GENCODE:
g + version number (e.g., g27)Experiment identifier for output file naming
Custom splice junctions file (gzip compressed) to use instead of globbing samples directory
Path to file containing tissue identifiers for annotation
If set, removes intermediate files after processing
Core Functions
generate_introns
Path to the gff annotation file
str - Path where introns file has been stored
load_annotations
Path to .gff annotation file
Annotation reference database (GENCODE, RefSeq, or UniProt)
Feature annotations to filter from the gff file
pd.DataFrame - Features extracted from gff with columns: seqname, type, start, end, strand, gene_id, gene_name, transcript_id, exon_id, exon_number
annotate_introns
Input DataFrame with whole introns annotation
pd.DataFrame - DataFrame with introns annotated including CDS coverage information
concat_samples
Annotation file directory pattern (e.g.,
{samples}/*/SJ.out.tab)Path to samples identifiers file for tissue annotation
pd.DataFrame - Concatenated SJ.out.tab data with tissue annotations
map_junctions_positions
DataFrame with concatenated RNA-seq splice junctions and reads
df_sj_max_position- Maximum coverage per junction positiondf_sj_max_tissue- Maximum coverage per junction position and tissue
score_per_junction
DataFrame with introns annotations
DataFrame with junction read positions and maximum coverage
pd.DataFrame - DataFrame with scores per exon, gene, and transcript including:
unique_reads- Number of unique readsgene_mean- Mean reads per genegene_mean_cds- Mean reads for CDS regions per geneRNA2sj- Ratio of unique reads to gene meanRNA2sj_cds- Ratio of unique reads to gene mean CDS
score_per_transcript
DataFrame with score per junction
pd.DataFrame - Final qsplice scores per transcript
Output Files
The module generates several output files in the specified output directory:sj_maxp..tsv.gz
Maximum splice junction coverage per positionsj_maxt..tsv.gz
Maximum splice junction coverage per tissue and positionsj_maxp..mapped.tsv.gz
Splice junctions mapped to gene annotations with scoresqsplice..tsv.gz
Main output file containing quantitative splice junction scores per transcript with columns:seqname- Chromosome/sequence namegene_id- Gene identifiergene_name- Gene nametranscript_id- Transcript identifierintron_number- Intron numberunique_reads- Number of unique reads (minimum per transcript)tissue- Tissue where maximum expression was observedgene_mean- Mean coverage across genegene_mean_cds- Mean coverage for CDS regionsRNA2sj- RNA-to-splice junction ratioRNA2sj_cds- RNA-to-splice junction ratio for CDS regions
Example Workflow
- Generate introns from GFF annotation using GenomeTools
- Load annotations and filter for CDS, exons, and introns
- Annotate introns with CDS coverage information
- Concatenate RNA-seq samples from STAR SJ.out.tab files
- Map junction positions to get maximum coverage
- Score junctions by merging annotations with coverage data
- Score transcripts by finding minimum junction score per transcript
Notes
- Requires GenomeTools (
gt) to be installed for intron generation - Input SJ.out.tab files should follow STAR aligner output format
- The module uses the minimum junction score per transcript as the transcript-level score
- CDS coverage is classified as:
full,partial, ornone