predict

Overview

The predict module generates TRIFID scores for transcript isoforms using a trained model. It handles multiple genome assemblies and releases, with automatic feature preprocessing and missing value handling.

Command-Line Usage

python -m trifid.models.predict \
    --config config/config.yaml \
    --features config/features.yaml \
    --model models/trifid.v_1_0_0.pkl \
    --assembly GRCh38 \
    --release g27

Command-Line Arguments

-a, --assembly

string

default:"GRCh38"

Genome assembly version. Supported assemblies:

Human: GRCh38, GRCh37
Mouse: GRCm39, GRCm38
Rat: Rnor_6.0
Zebrafish: GRCz11
Pig: Sscrofa11.1
Chimpanzee: Pan_tro_3.0
Chicken: GRCg6a
Cow: ARS-UCD1.2
Fly: BDGP6
Worm: WBcel235

-c, --config

string

default:"config/config.yaml"

Path to configuration file containing genome and annotation settings

-f, --features

string

default:"config/features.yaml"

Path to features selected description YAML file

-m, --model

string

default:"models/trifid.v_1_0_4.pkl"

Path to pretrained model pickle file

-r, --release

string

default:"g27"

Genome release version (e.g., g27, r110, e104)

Core Function

make_predictions()

Generates TRIFID predictions for transcript isoforms.

from trifid.models.predict import make_predictions

make_predictions(
    features=feature_list,
    ids=identifier_list,
    config=config_dict,
    model=trained_model,
    assembly="GRCh38",
    release="g27"
)

Parameters

features

list

required

List of feature column names to use for prediction

ids

list

required

List of identifier column names (e.g., gene_id, transcript_id)

config

dict

required

Configuration dictionary containing:

annotation.genome_version: Genome assembly version
annotation.db: Database directory name

model

object

required

Trained scikit-learn model object loaded from pickle file

assembly

string

required

Genome assembly version identifier

release

string

required

Genome release identifier

Returns

Writes predictions to trifid_predictions.tsv.gz with the following columns:

gene_id

string

Ensembl gene identifier

gene_name

string

Gene symbol/name

transcript_id

string

Ensembl transcript identifier

translation_id

string

Ensembl protein identifier

flags

string

Annotation flags

ccdsid

string

CCDS identifier (if available)

appris

string

APPRIS annotation

ann_type

string

Annotation type

length

integer

Transcript length

trifid_score

float

Raw TRIFID score (probability from 0-1)

norm_trifid_score

float

Normalized TRIFID score within gene

Assembly-Specific Preprocessing

The function automatically handles missing values and feature differences across assemblies:

Human Assemblies

GRCh38 (Gencode):

No special preprocessing required
All features available

GRCh37 (Gencode):

Fills missing values with -1

GRCh38/GRCh37 (RefSeq):

Fills RefSeq-unavailable features with -1
Handles CCDS binary encoding
Fills corsair features with 0
Fills RNA2sj features with -1 (GRCh37 only)

Model Organism Assemblies

Mouse (GRCm39, GRCm38):

Fills RNA2sj features with -1

Rat (Rnor_6.0), Zebrafish (GRCz11), Pig (Sscrofa11.1), Chimpanzee (Pan_tro_3.0):

Removes CCDS feature
Loads ACDS (Alternative CCDS) data
Fills RNA2sj features with -1

Chicken (GRCg6a), Cow (ARS-UCD1.2), Fly (BDGP6):

Fills all missing values with -1

Worm (WBcel235):

Fills all missing values with -1 or 0

Example Workflows

Basic Prediction

import pickle
import pandas as pd
from trifid.models.predict import make_predictions
from trifid.utils.utils import parse_yaml

# Load configuration
config = parse_yaml("config/config.yaml")
df_features = pd.DataFrame(parse_yaml("config/features.yaml"))

# Load model
model = pickle.load(open("models/trifid.v_1_0_4.pkl", "rb"))

# Extract feature and identifier lists
features = df_features[
    ~df_features["category"].str.contains("Identifier")
]["feature"].values
ids = df_features[
    df_features["category"].str.contains("Identifier")
]["feature"].values

# Generate predictions
make_predictions(
    features=features,
    ids=ids,
    config=config,
    model=model,
    assembly="GRCh38",
    release="g27"
)

Predict for Multiple Assemblies

# Human GRCh38
python -m trifid.models.predict \
    --assembly GRCh38 \
    --release g27

# Human GRCh37
python -m trifid.models.predict \
    --assembly GRCh37 \
    --release g27

# Mouse GRCm39
python -m trifid.models.predict \
    --assembly GRCm39 \
    --release m27

Custom Model Path

python -m trifid.models.predict \
    --config config/config.yaml \
    --features config/features.yaml \
    --model models/custom_model.pkl \
    --assembly GRCh38 \
    --release g27

Input Requirements

TRIFID Database

The input database (trifid_db.tsv.gz) must contain:

Identifier columns: gene_id, gene_name, transcript_id, translation_id, ccdsid
Feature columns: All features specified in features YAML
Annotation columns: flags, appris, ann_type

Configuration File

The config YAML must specify:

annotation:
  genome_version: "GRCh38"
  db: "g27"

Features File

The features YAML must include:

- feature: "gene_id"
  category: "Identifier"
- feature: "transcript_length"
  category: "Structural"
  refseq: "y"  # or "n" for RefSeq-incompatible features

Output Format

Predictions are saved to:

data/genomes/{assembly}/{release}/trifid_predictions.tsv.gz

The output file is a compressed TSV with headers and contains one row per transcript isoform.

Performance Considerations

The function processes the entire TRIFID database for the specified assembly
Memory usage depends on genome size (human: ~200K transcripts)
Prediction time: ~1-10 seconds for human genome with RandomForest model
Output file size: ~5-15 MB compressed

train: Model training
interpret: Model interpretation and feature importance
select: Model selection and evaluation

Preprocessing

Models

Data

Utils

Visualization

Overview

Command-Line Usage

Command-Line Arguments

Core Function

make_predictions()

Parameters

Returns

Assembly-Specific Preprocessing

Human Assemblies

Model Organism Assemblies

Example Workflows

Basic Prediction

Predict for Multiple Assemblies

Custom Model Path

Input Requirements

TRIFID Database

Configuration File

Features File

Output Format

Performance Considerations

Build docs developers (and LLMs) love

Preprocessing

Models

Data

Utils

Visualization

​Overview

​Command-Line Usage

​Command-Line Arguments

​Core Function

​make_predictions()

​Parameters

​Returns

​Assembly-Specific Preprocessing

​Human Assemblies

​Model Organism Assemblies

​Example Workflows

​Basic Prediction

​Predict for Multiple Assemblies

​Custom Model Path

​Input Requirements

​TRIFID Database

​Configuration File

​Features File

​Output Format

​Performance Considerations

​Related Modules

Build docs developers (and LLMs) love

Overview

Command-Line Usage

Command-Line Arguments

Core Function

make_predictions()

Parameters

Returns

Assembly-Specific Preprocessing

Human Assemblies

Model Organism Assemblies

Example Workflows

Basic Prediction

Predict for Multiple Assemblies

Custom Model Path

Input Requirements

TRIFID Database

Configuration File

Features File

Output Format

Performance Considerations

Related Modules