Custom Genome Analysis
Learn how to apply TRIFID to new species, custom genome annotations, or private datasets.Overview
TRIFID can be applied to any well-annotated eukaryotic genome. This guide covers:- Preparing custom genome annotations
- Generating required features
- Training species-specific models
- Using pre-trained models with transfer learning
Supported Species
TRIFID has been successfully applied to:Vertebrates
- Human (Homo sapiens) - GENCODE, RefSeq
- Mouse (Mus musculus) - GENCODE
- Rat (Rattus norvegicus) - Ensembl
- Zebrafish (Danio rerio) - Ensembl
- Chicken (Gallus gallus) - Ensembl
- Chimpanzee (Pan troglodytes) - Ensembl
- Pig (Sus scrofa) - Ensembl
- Cow (Bos taurus) - Ensembl
- Macaque (Macaca mulatta) - Ensembl
Invertebrates
- Fruit fly (Drosophila melanogaster) - FlyBase
- C. elegans (Caenorhabditis elegans) - WormBase
Pre-computed features and predictions are available for these species. See Data Availability.
Requirements for New Genomes
Essential Data
-
Genome annotation (GTF/GFF3)
- Transcript coordinates
- Exon/intron structure
- CDS annotations
- Gene/transcript IDs
-
Protein sequences (FASTA)
- Translated protein sequences for all isoforms
-
Principal isoform labels (optional but recommended)
- From APPRIS, UniProt, or manual curation
Optional but Beneficial
-
RNA-seq data
- STAR alignments (
SJ.out.tabfiles) - Multiple tissues for comprehensive coverage
- STAR alignments (
-
Conservation scores
- PhyloCSF or similar
- Cross-species alignments
-
Domain annotations
- Pfam, SMART, or InterPro
- APPRIS SPADE scores
Workflow for Custom Genomes
Step 1: Prepare Genome Annotation
Step 2: Configure TRIFID
Editconfig/config.yaml to add your genome:
Step 3: Generate Features
3.1 QSplice (RNA-seq junction coverage)
- QSplice features will be missing
- Model will use default/imputed values
- Performance may be slightly reduced but still functional
3.2 Pfam Effects (domain integrity)
- Run Pfam scan directly on protein sequences
- Use InterProScan for domain predictions
- Or skip this step (model will impute values)
3.3 Fragment Labeling
Step 4: Build Feature Dataset
Step 5: Choose Prediction Strategy
You have two options:Option A: Use Pre-trained Human Model (Recommended)
Apply the human-trained model directly:- No training data required
- Leverages extensive human proteomics evidence
- Generally works well for vertebrates
- May be less accurate for distant species (e.g., invertebrates)
- Species-specific features may not transfer well
Option B: Train Species-Specific Model
Train a new model with species-specific training data:- Optimized for your species
- Can incorporate species-specific biology
- Requires high-quality training labels
- Need sufficient training examples (>500 recommended)
Handling Missing Features
Feature Imputation Strategy
TRIFID can handle missing features through imputation:Feature Availability by Species
| Feature Category | Human | Mouse | Zebrafish | Fly | Required |
|---|---|---|---|---|---|
| Basic annotation | ✅ | ✅ | ✅ | ✅ | Yes |
| Transcript length | ✅ | ✅ | ✅ | ✅ | Yes |
| Protein sequence | ✅ | ✅ | ✅ | ✅ | Yes |
| APPRIS labels | ✅ | ✅ | ⚠️ | ❌ | No |
| TSL levels | ✅ | ✅ | ✅ | ❌ | No |
| PhyloCSF | ✅ | ✅ | ✅ | ✅ | No |
| RNA-seq (QSplice) | ✅ | ✅ | ⚠️ | ⚠️ | No |
| Pfam domains | ✅ | ✅ | ✅ | ✅ | No |
- ✅ Available
- ⚠️ Limited availability
- ❌ Not available
Case Study: Zebrafish (Danio rerio)
Here’s a complete example applying TRIFID to zebrafish:1. Download Ensembl Annotations
2. Update Configuration
3. Generate Features
4. Build and Predict
RefSeq vs Ensembl vs GENCODE
Annotation Differences
| Feature | GENCODE | Ensembl | RefSeq |
|---|---|---|---|
| TSL levels | ✅ | ✅ | ❌ |
| APPRIS | ✅ | Via APPRIS | Via APPRIS |
| Basic tag | ✅ | ❌ | ❌ |
| Gene levels | ✅ | ❌ | ❌ |
| CCDS | ✅ | ✅ | ✅ |
Handling Annotation-Specific Features
Transcript ID Patterns
TRIFID automatically detects transcript IDs:- Ensembl: ENST (human), ENSMUST (mouse), ENSDART (zebrafish), etc.
- RefSeq: NM_, XM_, YP_
- FlyBase: FBtr
Performance Expectations
Accuracy by Phylogenetic Distance
| Species Category | Expected Performance | Notes |
|---|---|---|
| Mammals (human-trained) | 85-90% accuracy | Best performance |
| Other vertebrates | 80-85% accuracy | Good performance |
| Invertebrates | 70-80% accuracy | Use with caution |
| Plants/Fungi | Not validated | Not recommended |
Minimum Requirements
For reliable predictions:- ✅ Complete genome annotation
- ✅ Protein sequences for all isoforms
- ✅ At least basic transcript metadata
- ⚠️ RNA-seq data (highly recommended)
- ⚠️ Conservation scores (recommended)
Validation
Validating Custom Predictions
Troubleshooting
Common Issues
Issue: Model predictions are all similar- Cause: Missing critical features
- Solution: Check feature completeness, ensure domain and length features are available
- Cause: Imputation values too conservative
- Solution: Adjust imputation strategy, use percentile-based imputation
- Cause: Large genome with many isoforms
- Solution: Use
reduce_mem_usage()utility, process in batches
- Cause: Training features don’t match prediction features
- Solution: Ensure exact feature list from training:
Best Practices
- Always validate predictions with known functional isoforms
- Check feature completeness before prediction
- Use species-appropriate imputation for missing features
- Compare with orthologous genes in well-studied species
- Integrate with experimental data when available
- Report prediction confidence using normalized scores
Example Script: Complete Workflow
Further Resources
- APPRIS Web Server - Principal isoform annotations
- Ensembl FTP - Genome annotations
- GENCODE - Human and mouse annotations
- RefSeq - Reference sequences
Next Steps
- Review End-to-End Tutorial for detailed workflow
- See FGFR1, C1orf112, and NIPAL3 case studies
- Explore Model Interpretation guide