qsplice

Overview

The qsplice module quantifies splice junction coverage from RNA-seq data released by STAR and maps it to genome positions. It processes SJ.out.tab files, annotates introns, and generates quantitative splice junction scores per gene and transcript.

Usage

python -m trifid.preprocessing.qsplice \
    --gff ~/data/genome_annotation/GRCh38/g27/gencode.v27.annotation.gff3.gz \
    --outdir data/external/qsplice/GRCh38/g27 \
    --samples /home/projects/rnaseq/out/E-MTAB-2836/GRCh38/STAR/g27 \
    --version g27 \
    --experiment emtab2836 \
    --rm

Command-Line Arguments

--gff

string

required

Custom or GENCODE/Ensembl gff annotation file (gzip compressed)

--outdir

string

required

Output directory where results will be stored

--samples

string

Directory containing SJ.out.tab files to be globbed and concatenated. Pattern: {samples}/*/SJ.out.tab

--version

string

required

Genome annotation version. For GENCODE: g + version number (e.g., g27)

--experiment

string

default:"emtab2836"

Experiment identifier for output file naming

--file

string

Custom splice junctions file (gzip compressed) to use instead of globbing samples directory

--tissue_ids

string

Path to file containing tissue identifiers for annotation

--rm

boolean

default:"false"

If set, removes intermediate files after processing

Core Functions

generate_introns

def generate_introns(inpath: str) -> str

Generates introns file from gff annotation using GenomeTools.

inpath

string

Path to the gff annotation file

Returns: str - Path where introns file has been stored

load_annotations

def load_annotations(filepath: str, db: str, features: list = ["CDS", "exon", "intron"]) -> pd.DataFrame

Loads features from genome annotation file.

filepath

string

Path to .gff annotation file

string

Annotation reference database (GENCODE, RefSeq, or UniProt)

features

list

default:"[\"CDS\", \"exon\", \"intron\"]"

Feature annotations to filter from the gff file

Returns: pd.DataFrame - Features extracted from gff with columns: seqname, type, start, end, strand, gene_id, gene_name, transcript_id, exon_id, exon_number

annotate_introns

def annotate_introns(df: pd.DataFrame) -> pd.DataFrame

Annotates introns with CDS positions by merging exons and CDS information.

pd.DataFrame

Input DataFrame with whole introns annotation

Returns: pd.DataFrame - DataFrame with introns annotated including CDS coverage information

concat_samples

def concat_samples(indir: str, tissue_ids: str) -> pd.DataFrame

Concatenates several SJ.out.tab files into a single pandas DataFrame.

indir

string

Annotation file directory pattern (e.g., {samples}/*/SJ.out.tab)

tissue_ids

string

Path to samples identifiers file for tissue annotation

Returns: pd.DataFrame - Concatenated SJ.out.tab data with tissue annotations

map_junctions_positions

def map_junctions_positions(df: pd.DataFrame) -> tuple[pd.DataFrame, pd.DataFrame]

Extracts the highest coverage per junction position and per tissue.

pd.DataFrame

DataFrame with concatenated RNA-seq splice junctions and reads

Returns: Tuple of two DataFrames:

df_sj_max_position - Maximum coverage per junction position
df_sj_max_tissue - Maximum coverage per junction position and tissue

score_per_junction

def score_per_junction(df_introns: pd.DataFrame, df_sj_max_position: pd.DataFrame) -> pd.DataFrame

Calculates means and scores per junction.

df_introns

pd.DataFrame

DataFrame with introns annotations

df_sj_max_position

pd.DataFrame

DataFrame with junction read positions and maximum coverage

Returns: pd.DataFrame - DataFrame with scores per exon, gene, and transcript including:

unique_reads - Number of unique reads
gene_mean - Mean reads per gene
gene_mean_cds - Mean reads for CDS regions per gene
RNA2sj - Ratio of unique reads to gene mean
RNA2sj_cds - Ratio of unique reads to gene mean CDS

score_per_transcript

def score_per_transcript(df: pd.DataFrame) -> pd.DataFrame

Calculates minimum junction scores per transcript (qsplice output).

pd.DataFrame

DataFrame with score per junction

Returns: pd.DataFrame - Final qsplice scores per transcript

Output Files

The module generates several output files in the specified output directory:

sj_maxp..tsv.gz

Maximum splice junction coverage per position

sj_maxt..tsv.gz

Maximum splice junction coverage per tissue and position

sj_maxp..mapped.tsv.gz

Splice junctions mapped to gene annotations with scores

qsplice..tsv.gz

Main output file containing quantitative splice junction scores per transcript with columns:

seqname - Chromosome/sequence name
gene_id - Gene identifier
gene_name - Gene name
transcript_id - Transcript identifier
intron_number - Intron number
unique_reads - Number of unique reads (minimum per transcript)
tissue - Tissue where maximum expression was observed
gene_mean - Mean coverage across gene
gene_mean_cds - Mean coverage for CDS regions
RNA2sj - RNA-to-splice junction ratio
RNA2sj_cds - RNA-to-splice junction ratio for CDS regions

Example Workflow

Generate introns from GFF annotation using GenomeTools
Load annotations and filter for CDS, exons, and introns
Annotate introns with CDS coverage information
Concatenate RNA-seq samples from STAR SJ.out.tab files
Map junction positions to get maximum coverage
Score junctions by merging annotations with coverage data
Score transcripts by finding minimum junction score per transcript

Notes

Requires GenomeTools (gt) to be installed for intron generation
Input SJ.out.tab files should follow STAR aligner output format
The module uses the minimum junction score per transcript as the transcript-level score
CDS coverage is classified as: full, partial, or none

Preprocessing

Models

Data

Utils

Visualization

Overview

Usage

Command-Line Arguments

Core Functions

generate_introns

load_annotations

annotate_introns

concat_samples

map_junctions_positions

score_per_junction

score_per_transcript

Output Files

sj_maxp..tsv.gz

sj_maxt..tsv.gz

sj_maxp..mapped.tsv.gz

qsplice..tsv.gz

Example Workflow

Notes

Build docs developers (and LLMs) love

Preprocessing

Models

Data

Utils

Visualization

​Overview

​Usage

​Command-Line Arguments

​Core Functions

​generate_introns

​load_annotations

​annotate_introns

​concat_samples

​map_junctions_positions

​score_per_junction

​score_per_transcript

​Output Files

​sj_maxp..tsv.gz

​sj_maxt..tsv.gz

​sj_maxp..mapped.tsv.gz

​qsplice..tsv.gz

​Example Workflow

​Notes

Build docs developers (and LLMs) love

Overview

Usage

Command-Line Arguments

Core Functions

generate_introns

load_annotations

annotate_introns

concat_samples

map_junctions_positions

score_per_junction

score_per_transcript

Output Files

sj_maxp..tsv.gz

sj_maxt..tsv.gz

sj_maxp..mapped.tsv.gz

qsplice..tsv.gz

Example Workflow

Notes