Utils Module
Thetrifid.utils.utils module contains convenient utility functions for data processing, feature engineering, and general operations.
Classes
Statistics
A class for calculating TRIFID prediction statistics across APPRIS principal and alternative isoforms. Constructor:df(list): DataFrame with TRIFID predictionsnr(bool, optional): If True, removes redundant transcripts. Defaults to False.
get_stats()
Functions
balanced_training_set()
Creates a balanced 1:1 proportion labeled training set.df(pd.DataFrame): Labelled datasetseed(int, optional): Random state. Defaults to 1.
create_dir()
Python equivalent ofmkdir -p.
dirpath(str): Path to create the new folder
delta_score()
Calculates delta scores for features (commonly used for length).df(pd.DataFrame): Input datasetfeatures(list): List of features to transformmode(str): Mode to select the reference (‘appris’ or ‘longest’)groupby_feature(str, optional): Feature to calculate score differences. Defaults to ‘gene_id’.
fragments_correction()
Corrects fragment isoform values by the homologous sequence.df_iso(pd.DataFrame): Isoforms databasefeatures(list): List of features to correct
generate_training_set()
Creates a training set from GENCODE isoforms and their proteomics evidence.df_features(pd.DataFrame): TRIFID database datasetfilepath(str): Filepath to load the labeled isoforms
generate_trifid_metrics()
Generates TRIFID scores to evaluate predictions.df(pd.DataFrame): Dataset to update with predictionsfeatures(pd.DataFrame): Features for making predictionsmodel(object): Scikit-learn model to make predictionsnmax_norm_median(bool): Min maximum threshold in normalization. Defaults to False.
trifid_score, norm_trifid_score)
Example:
get_df_info()
Gets pandas DataFrame info for logging.df(pd.DataFrame): Input dataset
get_id_patterns()
Returns transcript identifier patterns for different species.- Homo sapiens (ENST0)
- Mouse (ENSMUST)
- Danio rerio (ENSDART)
- Rattus norvegicus (ENSRNOT)
- Sus scrofa (ENSSSCT)
- Pan troglodytes (ENSPTRT)
- Gallus gallus (ENSGALT)
- Bos taurus (ENSBTAT)
- Drosophila melanogaster (FBtr)
- RefSeq (NM, XM, YP)
group_normalization()
Normalizes features by group.df(pd.DataFrame): Input datasetfeatures(list): List of features to normalizenmax(int, optional): Maximum score value. Defaults to 0.nmin(int, optional): Minimum score value. Defaults to 0.groupby_feature(str, optional): Feature to group by. Defaults to ‘gene_id’.
norm_)
impute()
Imputes missing values in features.df(pd.DataFrame): Input datasetfeatures(list): List of features to imputen(int, optional): Value to imputeitype(str, optional): Type of imputation (‘class’, ‘conditional’, ‘percentile’, ‘same_as_norm’). Defaults to ‘class’.column(str, optional): Column for conditional imputationcondition(str, optional): Condition valuepercentile(float, optional): Percentile for percentile imputation
merge_dataframes()
Merges multiple DataFrames on transcript identifier.*args: Variable number of DataFrames to mergeon_type(str, optional): Merge key feature. Defaults to ‘transcript_id’.how_type(str, optional): Merge method. Defaults to ‘left’.pivot_on(int, optional): Index of pivot DataFrame for left merge. Defaults to 0.nimpute(int, optional): Value to fill NaN. Defaults to None.
one_hot_encoding()
One-hot encodes selected features.df(pd.DataFrame): Input datasetfeatures(list): Features to encode
open_files()
Opens both compressed and non-compressed files.filepath(str): File path
parse_yaml()
Parses YAML configuration files.yaml_file(str): Config file path in YAML format
reduce_mem_usage()
Reduces memory usage of pandas DataFrames.df(pd.DataFrame): Input datasetverbose(bool, optional): Verbosity control. Defaults to False.round_float(int, optional): Round decimals. Defaults to False.
reorder_cols()
Reorders DataFrame columns with objects first, then numerics.df(pd.DataFrame): Input DataFrame
round_df_floats()
Rounds all float columns in a DataFrame.df(pd.DataFrame): Input datasetn(int, optional): Number of decimal places. Defaults to 4.
timer()
Context manager for timing code blocks.title(str): Description of the timed block
unity_ranger()
Truncates values to range [0, 1].df(pd.DataFrame): Input datasetfeatures(list): Features to truncate