Skip to main content

FASTA File Requirements

PROTÉGÉ PD accepts nucleotide sequences in FASTA format. The input file must meet specific requirements for proper processing.

Basic Format

>sequence_identifier_1
ATGGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG
>sequence_identifier_2
ATGGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAA
>sequence_identifier_3
ATGGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAC
Each sequence begins with a header line starting with > followed by the sequence identifier.

Critical Requirements

1. Protein-Coding Sequences Only

PROTÉGÉ PD is designed specifically for protein-coding gene sequences. Non-coding sequences will produce errors or meaningless results.
Why protein-coding?
  • PROTÉGÉ translates sequences to amino acids for alignment
  • Alignment is performed at the protein level
  • Consensus is back-translated to nucleotides
  • This approach preserves codon structure
Requirements:
  • Sequences must be in-frame coding sequences (CDS)
  • Must start with a start codon (ATG)
  • Must be divisible by 3 (complete codons)
  • Should not contain stop codons except at the end

2. Sequence Length Requirements

Length
constraint
Must be divisible by 3 (complete codons)
# From protege.py:136
nucLen.append(len(nuc))
# Sequences are translated:
amino = nuc.translate()
Sequences not divisible by 3 will cause translation errors or frameshift issues.

3. Nucleotide Alphabet

Accepted characters:
  • A - Adenine
  • T - Thymine
  • G - Guanine
  • C - Cytosine
  • N - Any nucleotide (ambiguous)
Not recommended:
  • Degenerate IUPAC codes in input (R, Y, W, S, etc.)
  • Lowercase letters (may work but uppercase preferred)
  • Gaps or dashes in unaligned sequences

File Naming Conventions

Supported Extensions

# All of these work:
genes.fasta
sequences.fna
mydata.fas
samples.fa

Naming Best Practices

gyrB_genes.fasta
recA_aligned.fna
16S_rRNA.fas
species_COI.fasta
Use descriptive names that indicate the gene, organism group, or project. Avoid spaces and special characters.

Example FASTA Files

Minimal Example

>strain_1_gyrB
ATGTCGGTTCGTGACGTGAAACCGGTCGCTGAAGGTATCGGTGCTGGTAAACGTGCTGGT
GCTGGTGTTGCTGGTGCTAAACGTGGTCGTGCTGGTGCTGGTGTTCGTGCTCGTGCTCGT
>strain_2_gyrB
ATGTCGGTTCGTGACGTGAAACCGGTCGCTGAAGGTATCGGTGCTGGTAAACGTGCTGGT
GCTGGTGTTGCTGGTGCTAAACGTGGTCGTGCTGGTGCTGGTGTTCGTGCTCGTGCTAAA
>strain_3_gyrB
ATGTCGGTTCGTGACGTGAAACCGGTCGCTGAAGGTATCGGTGCTGGTAAACGTGCTGGT
GCTGGTGTTGCTGGTGCTAAACGTGGTCGTGCTGGTGCTGGTGTTCGTGCTCGTGCTCAA

Production Example

>Escherichia_coli_K12_gyrB
ATGTCGGTTCGTGACGTGAAACCGGTCGCTGAAGGTATCGGTGCTGGTAAACGTGCTGGT
GCTGGTGTTGCTGGTGCTAAACGTGGTCGTGCTGGTGCTGGTGTTCGTGCTCGTGCTCGT
GTTGGTGCTGGTGTTCGTGCTGGTAAACGTGGTCGTGCTGGTGCTGGTGTTGCTGGTGCT
>Salmonella_enterica_gyrB
ATGTCGGTTCGTGACGTGAAACCGGTCGCTGAAGGTATCGGTGCTGGTAAACGTGCTGGT
GCTGGTGTTGCTGGTGCTAAACGTGGTCGTGCTGGTGCTGGTGTTCGTGCTCGTGCTAAA
GTTGGTGCTGGTGTTCGTGCTGGTAAACGTGGTCGTGCTGGTGCTGGTGTTGCTGGTGCT
>Klebsiella_pneumoniae_gyrB
ATGTCGGTTCGTGACGTGAAACCGGTCGCTGAAGGTATCGGTGCTGGTAAACGTGCTGGT
GCTGGTGTTGCTGGTGCTAAACGTGGTCGTGCTGGTGCTGGTGTTCGTGCTCGTGCTCAA
GTTGGTGCTGGTGTTCGTGCTGGTAAACGTGGTCGTGCTGGTGCTGGTGTTGCTGGTGCT

File Preparation Guidelines

1. Extract Coding Sequences

1

Identify gene boundaries

Use genome annotation or BLAST to find start and stop positions
2

Extract complete CDS

Include start codon (ATG) through stop codon (TAA/TAG/TGA)
3

Verify translation

Ensure sequences translate without internal stop codons
4

Remove stop codons (optional)

PROTÉGÉ can handle terminal stop codons, but removal is safer

2. Verify Sequence Quality

# Quick Python check
from Bio import SeqIO

for record in SeqIO.parse("genes.fasta", "fasta"):
    if len(record.seq) % 3 != 0:
        print(f"Error: {record.id} length not divisible by 3")
    else:
        print(f"OK: {record.id} length = {len(record.seq)}")
from Bio import SeqIO

for record in SeqIO.parse("genes.fasta", "fasta"):
    protein = record.seq.translate()
    if "*" in protein[:-1]:  # Stop codon in middle
        print(f"Warning: {record.id} has internal stop codon")
    else:
        print(f"OK: {record.id} translates correctly")
# Count N's in sequences
grep -v ">" genes.fasta | grep -o "N" | wc -l

# Find sequences with N's
grep -B 1 "N" genes.fasta | grep ">"

3. Sequence Alignment Preparation

PROTÉGÉ performs its own alignment using MUSCLE. Do not pre-align your sequences - provide unaligned FASTA files.
PROTÉGÉ’s internal workflow:
# From protege.py:138-139
amino = nuc.translate()
# Sequences are aligned at protein level (lines 179-182)
alnProc = subprocess.run(["muscle_lin", "-in", in_file, "-out", out_file])

Common Issues and Solutions

Issue: Translation Error

Error message: Seq.translate() got an invalid codon
Causes:
  • Sequence length not divisible by 3
  • Non-standard nucleotide characters
  • Sequence not in correct reading frame
Solution:
# Check sequence lengths
grep -v ">" genes.fasta | awk '{print length}'

# All numbers should be divisible by 3

Issue: Internal Stop Codons

Symptom: Truncated protein sequences or alignment errors
Causes:
  • Wrong reading frame
  • Sequencing errors
  • Pseudogenes or non-functional sequences
Solution:
  • Verify gene annotation
  • Check sequence orientation (may need reverse complement)
  • Remove problematic sequences

Issue: File Not Found

Error message: File not found or cannot open file
Causes:
  • File not in mounted directory
  • Incorrect filename in command
  • File permissions issue
Solution:
# List files in container
docker run --rm \
  --mount type=bind,source=/your/path/,target=/root/. \
  ddelgadillo/protege_base:v1.0.2 \
  ls -la /root/

Issue: Too Few Sequences

Recommendation: Use at least 5-10 sequences for meaningful consensus
Why:
  • Fewer sequences = less reliable consensus
  • May not capture sequence diversity
  • Primers may not work on related taxa
Solution:
  • Include representatives from target group
  • Balance between diversity and conservation

Issue: Sequences Too Divergent

Symptom: No consensus regions found or all primers highly degenerate
Causes:
  • Sequences from distantly related organisms
  • Wrong gene region selected
  • Mixed gene families
Solution:
  • Use more closely related sequences
  • Select more conserved gene regions
  • Reduce consensus threshold (-c 80 or lower)

Best Practices Summary

Sequence Selection

  • Use protein-coding genes only
  • Include 10-50 representative sequences
  • Balance conservation and diversity
  • Verify all sequences are same gene

File Preparation

  • Extract complete CDS (start to stop)
  • Verify length divisible by 3
  • Check translation for stop codons
  • Use descriptive filenames

Quality Control

  • Remove sequences with ambiguous bases
  • Verify correct reading frame
  • Check for sequencing errors
  • Ensure consistent gene boundaries

Organization

  • Keep sequences in dedicated directory
  • Use consistent naming conventions
  • Document sequence sources
  • Back up original files

See Also

Build docs developers (and LLMs) love