Utils Module

The trifid.utils.utils module contains convenient utility functions for data processing, feature engineering, and general operations.

Classes

Statistics

A class for calculating TRIFID prediction statistics across APPRIS principal and alternative isoforms. Constructor:

Statistics(df: list, nr: bool = False)

Parameters:

df (list): DataFrame with TRIFID predictions
nr (bool, optional): If True, removes redundant transcripts. Defaults to False.

Methods:

get_stats()

get_stats(cutoff: float = 0.5, norm_double_check: bool = False, cutoff_feature: str = "trifid_score")

Returns statistics for functional and non-functional isoforms. Returns: DataFrame with counts and percentages for PRINCIPAL, ALTERNATIVE, and Total categories. Example:

from trifid.utils.utils import Statistics
import pandas as pd

predictions = pd.read_csv('trifid_predictions.tsv.gz', sep='\t', compression='gzip')
stats = Statistics(predictions)
print(stats.get_stats())

Functions

balanced_training_set()

Creates a balanced 1:1 proportion labeled training set.

balanced_training_set(df: pd.DataFrame, seed: int = 1) -> pd.DataFrame

Parameters:

df (pd.DataFrame): Labelled dataset
seed (int, optional): Random state. Defaults to 1.

Returns: Balanced dataset Example:

from trifid.utils.utils import balanced_training_set

balanced_df = balanced_training_set(training_data, seed=123)

create_dir()

Python equivalent of mkdir -p.

create_dir(dirpath: str) -> str

Parameters:

dirpath (str): Path to create the new folder

Returns: Absolute path of the new folder

delta_score()

Calculates delta scores for features (commonly used for length).

delta_score(
    df: pd.DataFrame, 
    features: list, 
    mode: str = "appris", 
    groupby_feature: str = "gene_id"
) -> pd.DataFrame

Parameters:

df (pd.DataFrame): Input dataset
features (list): List of features to transform
mode (str): Mode to select the reference (‘appris’ or ‘longest’)
groupby_feature (str, optional): Feature to calculate score differences. Defaults to ‘gene_id’.

Returns: DataFrame with delta score columns added Description: Creates a feature that subtracts the largest feature value of isoform to the current one, then normalizes by the largest length of isoform per gene.

fragments_correction()

Corrects fragment isoform values by the homologous sequence.

fragments_correction(df_iso: pd.DataFrame, features: list) -> pd.DataFrame

Parameters:

df_iso (pd.DataFrame): Isoforms database
features (list): List of features to correct

Returns: DataFrame with corrected feature columns

generate_training_set()

Creates a training set from GENCODE isoforms and their proteomics evidence.

generate_training_set(df_features: pd.DataFrame, filepath: str) -> pd.DataFrame

Parameters:

df_features (pd.DataFrame): TRIFID database dataset
filepath (str): Filepath to load the labeled isoforms

Returns: Training set with labels Example:

from trifid.utils.utils import generate_training_set

training_set = generate_training_set(
    df_features, 
    'data/proteomics_evidence.tsv.gz'
)

generate_trifid_metrics()

Generates TRIFID scores to evaluate predictions.

generate_trifid_metrics(
    df: pd.DataFrame, 
    features: pd.DataFrame, 
    model: object, 
    nmax_norm_median: bool = False
) -> pd.DataFrame

Parameters:

df (pd.DataFrame): Dataset to update with predictions
features (pd.DataFrame): Features for making predictions
model (object): Scikit-learn model to make predictions
nmax_norm_median (bool): Min maximum threshold in normalization. Defaults to False.

Returns: DataFrame with TRIFID metrics (trifid_score, norm_trifid_score) Example:

from trifid.utils.utils import generate_trifid_metrics
import pickle

model = pickle.load(open('selected_model.pkl', 'rb'))
predictions = generate_trifid_metrics(df, features, model)

get_df_info()

Gets pandas DataFrame info for logging.

get_df_info(df: pd.DataFrame) -> str

Parameters:

df (pd.DataFrame): Input dataset

Returns: Info message string

get_id_patterns()

Returns transcript identifier patterns for different species.

get_id_patterns() -> tuple

Returns: Tuple with regex patterns for species-specific transcript identifiers Supported species:

Homo sapiens (ENST0)
Mouse (ENSMUST)
Danio rerio (ENSDART)
Rattus norvegicus (ENSRNOT)
Sus scrofa (ENSSSCT)
Pan troglodytes (ENSPTRT)
Gallus gallus (ENSGALT)
Bos taurus (ENSBTAT)
Drosophila melanogaster (FBtr)
RefSeq (NM, XM, YP)

group_normalization()

Normalizes features by group.

group_normalization(
    df: pd.DataFrame, 
    features: list, 
    nmax: int = 0, 
    nmin: int = 0, 
    groupby_feature: str = "gene_id"
) -> pd.DataFrame

Parameters:

df (pd.DataFrame): Input dataset
features (list): List of features to normalize
nmax (int, optional): Maximum score value. Defaults to 0.
nmin (int, optional): Minimum score value. Defaults to 0.
groupby_feature (str, optional): Feature to group by. Defaults to ‘gene_id’.

Returns: DataFrame with normalized features added (prefixed with norm_)

impute()

Imputes missing values in features.

impute(
    df: pd.DataFrame,
    features: list,
    n: int = None,
    itype: str = "class",
    column: str = None,
    condition: str = None,
    percentile: float = None,
) -> pd.DataFrame

Parameters:

df (pd.DataFrame): Input dataset
features (list): List of features to impute
n (int, optional): Value to impute
itype (str, optional): Type of imputation (‘class’, ‘conditional’, ‘percentile’, ‘same_as_norm’). Defaults to ‘class’.
column (str, optional): Column for conditional imputation
condition (str, optional): Condition value
percentile (float, optional): Percentile for percentile imputation

Returns: DataFrame with imputed values

merge_dataframes()

Merges multiple DataFrames on transcript identifier.

merge_dataframes(
    *args, 
    on_type: str = "transcript_id", 
    how_type: str = "left", 
    pivot_on: int = 0, 
    nimpute: int = None
) -> pd.DataFrame

Parameters:

*args: Variable number of DataFrames to merge
on_type (str, optional): Merge key feature. Defaults to ‘transcript_id’.
how_type (str, optional): Merge method. Defaults to ‘left’.
pivot_on (int, optional): Index of pivot DataFrame for left merge. Defaults to 0.
nimpute (int, optional): Value to fill NaN. Defaults to None.

Returns: Merged DataFrame Example:

from trifid.utils.utils import merge_dataframes

merged = merge_dataframes(df1, df2, df3, on_type='transcript_id')

one_hot_encoding()

One-hot encodes selected features.

one_hot_encoding(df: pd.DataFrame, features: list) -> pd.DataFrame

Parameters:

df (pd.DataFrame): Input dataset
features (list): Features to encode

Returns: DataFrame with one-hot encoded features

open_files()

Opens both compressed and non-compressed files.

open_files(filepath: str) -> object

Parameters:

filepath (str): File path

Returns: Open file object

parse_yaml()

Parses YAML configuration files.

parse_yaml(yaml_file: str) -> dict

Parameters:

yaml_file (str): Config file path in YAML format

Returns: Dictionary with configuration data Example:

from trifid.utils.utils import parse_yaml

config = parse_yaml('config/config.yaml')

reduce_mem_usage()

Reduces memory usage of pandas DataFrames.

reduce_mem_usage(
    df: pd.DataFrame, 
    verbose: bool = False, 
    round_float: int = False
) -> pd.DataFrame

Parameters:

df (pd.DataFrame): Input dataset
verbose (bool, optional): Verbosity control. Defaults to False.
round_float (int, optional): Round decimals. Defaults to False.

Returns: Tuple of (DataFrame with reduced memory, list of NA columns)

reorder_cols()

Reorders DataFrame columns with objects first, then numerics.

reorder_cols(df: pd.DataFrame) -> pd.DataFrame

Parameters:

df (pd.DataFrame): Input DataFrame

Returns: Reordered DataFrame

round_df_floats()

Rounds all float columns in a DataFrame.

round_df_floats(df: pd.DataFrame, n: int = 4) -> pd.DataFrame

Parameters:

df (pd.DataFrame): Input dataset
n (int, optional): Number of decimal places. Defaults to 4.

Returns: DataFrame with rounded floats

timer()

Context manager for timing code blocks.

@contextmanager
timer(title: str)

Parameters:

title (str): Description of the timed block

Example:

from trifid.utils.utils import timer

with timer("Processing data"):
    # Your code here
    process_data()

unity_ranger()

Truncates values to range [0, 1].

unity_ranger(df: pd.DataFrame, features: list) -> pd.DataFrame

Parameters:

df (pd.DataFrame): Input dataset
features (list): Features to truncate

Returns: DataFrame with values bounded between 0 and 1 Description: Values higher than 1 are set to 1, values lower than 0 are set to 0.

Preprocessing

Models

Data

Utils

Visualization

Utils

Utils Module

Classes

Statistics

get_stats()

Functions

balanced_training_set()

create_dir()

delta_score()

fragments_correction()

generate_training_set()

generate_trifid_metrics()

get_df_info()

get_id_patterns()

group_normalization()

impute()

merge_dataframes()

one_hot_encoding()

open_files()

parse_yaml()

reduce_mem_usage()

reorder_cols()

round_df_floats()

timer()

unity_ranger()

See Also

Build docs developers (and LLMs) love

Preprocessing

Models

Data

Utils

Visualization

​Utils Module

​Classes

​Statistics

​get_stats()

​Functions

​balanced_training_set()

​create_dir()

​delta_score()

​fragments_correction()

​generate_training_set()

​generate_trifid_metrics()

​get_df_info()

​get_id_patterns()

​group_normalization()

​impute()

​merge_dataframes()

​one_hot_encoding()

​open_files()

​parse_yaml()

​reduce_mem_usage()

​reorder_cols()

​round_df_floats()

​timer()

​unity_ranger()

​See Also

Build docs developers (and LLMs) love

Utils Module

Classes

Statistics

get_stats()

Functions

balanced_training_set()

create_dir()

delta_score()

fragments_correction()

generate_training_set()

generate_trifid_metrics()

get_df_info()

get_id_patterns()

group_normalization()

impute()

merge_dataframes()

one_hot_encoding()

open_files()

parse_yaml()

reduce_mem_usage()

reorder_cols()

round_df_floats()

timer()

unity_ranger()

See Also