FASTA File Requirements
PROTÉGÉ PD accepts nucleotide sequences in FASTA format. The input file must meet specific requirements for proper processing.Basic Format
Each sequence begins with a header line starting with
> followed by the sequence identifier.Critical Requirements
1. Protein-Coding Sequences Only
Why protein-coding?- PROTÉGÉ translates sequences to amino acids for alignment
- Alignment is performed at the protein level
- Consensus is back-translated to nucleotides
- This approach preserves codon structure
- Sequences must be in-frame coding sequences (CDS)
- Must start with a start codon (ATG)
- Must be divisible by 3 (complete codons)
- Should not contain stop codons except at the end
2. Sequence Length Requirements
Must be divisible by 3 (complete codons)
3. Nucleotide Alphabet
Accepted characters:A- AdenineT- ThymineG- GuanineC- CytosineN- Any nucleotide (ambiguous)
- Degenerate IUPAC codes in input (R, Y, W, S, etc.)
- Lowercase letters (may work but uppercase preferred)
- Gaps or dashes in unaligned sequences
File Naming Conventions
Supported Extensions
Naming Best Practices
Use descriptive names that indicate the gene, organism group, or project. Avoid spaces and special characters.
Example FASTA Files
Minimal Example
Production Example
File Preparation Guidelines
1. Extract Coding Sequences
2. Verify Sequence Quality
Check sequence length
Check sequence length
Verify translation
Verify translation
Check for ambiguous bases
Check for ambiguous bases
3. Sequence Alignment Preparation
PROTÉGÉ performs its own alignment using MUSCLE. Do not pre-align your sequences - provide unaligned FASTA files.
Common Issues and Solutions
Issue: Translation Error
Causes:- Sequence length not divisible by 3
- Non-standard nucleotide characters
- Sequence not in correct reading frame
Issue: Internal Stop Codons
Causes:- Wrong reading frame
- Sequencing errors
- Pseudogenes or non-functional sequences
- Verify gene annotation
- Check sequence orientation (may need reverse complement)
- Remove problematic sequences
Issue: File Not Found
Causes:- File not in mounted directory
- Incorrect filename in command
- File permissions issue
Issue: Too Few Sequences
Recommendation: Use at least 5-10 sequences for meaningful consensus
- Fewer sequences = less reliable consensus
- May not capture sequence diversity
- Primers may not work on related taxa
- Include representatives from target group
- Balance between diversity and conservation
Issue: Sequences Too Divergent
Causes:- Sequences from distantly related organisms
- Wrong gene region selected
- Mixed gene families
- Use more closely related sequences
- Select more conserved gene regions
- Reduce consensus threshold (-c 80 or lower)
Best Practices Summary
Sequence Selection
- Use protein-coding genes only
- Include 10-50 representative sequences
- Balance conservation and diversity
- Verify all sequences are same gene
File Preparation
- Extract complete CDS (start to stop)
- Verify length divisible by 3
- Check translation for stop codons
- Use descriptive filenames
Quality Control
- Remove sequences with ambiguous bases
- Verify correct reading frame
- Check for sequencing errors
- Ensure consistent gene boundaries
Organization
- Keep sequences in dedicated directory
- Use consistent naming conventions
- Document sequence sources
- Back up original files
See Also
- Command Reference - CLI parameters
- Output Files - Understanding results
- Running PROTÉGÉ - Docker usage