Input Files

FASTA File Requirements

PROTÉGÉ PD accepts nucleotide sequences in FASTA format. The input file must meet specific requirements for proper processing.

Basic Format

>sequence_identifier_1
ATGGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG
>sequence_identifier_2
ATGGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAA
>sequence_identifier_3
ATGGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAC

Each sequence begins with a header line starting with > followed by the sequence identifier.

Critical Requirements

1. Protein-Coding Sequences Only

PROTÉGÉ PD is designed specifically for protein-coding gene sequences. Non-coding sequences will produce errors or meaningless results.

Why protein-coding?

PROTÉGÉ translates sequences to amino acids for alignment
Alignment is performed at the protein level
Consensus is back-translated to nucleotides
This approach preserves codon structure

Requirements:

Sequences must be in-frame coding sequences (CDS)
Must start with a start codon (ATG)
Must be divisible by 3 (complete codons)
Should not contain stop codons except at the end

2. Sequence Length Requirements

Length

constraint

Must be divisible by 3 (complete codons)

# From protege.py:136
nucLen.append(len(nuc))
# Sequences are translated:
amino = nuc.translate()

Sequences not divisible by 3 will cause translation errors or frameshift issues.

3. Nucleotide Alphabet

Accepted characters:

A - Adenine
T - Thymine
G - Guanine
C - Cytosine
N - Any nucleotide (ambiguous)

Not recommended:

Degenerate IUPAC codes in input (R, Y, W, S, etc.)
Lowercase letters (may work but uppercase preferred)
Gaps or dashes in unaligned sequences

File Naming Conventions

Supported Extensions

# All of these work:
genes.fasta
sequences.fna
mydata.fas
samples.fa

Naming Best Practices

gyrB_genes.fasta
recA_aligned.fna
16S_rRNA.fas
species_COI.fasta

Use descriptive names that indicate the gene, organism group, or project. Avoid spaces and special characters.

Example FASTA Files

Minimal Example

>strain_1_gyrB
ATGTCGGTTCGTGACGTGAAACCGGTCGCTGAAGGTATCGGTGCTGGTAAACGTGCTGGT
GCTGGTGTTGCTGGTGCTAAACGTGGTCGTGCTGGTGCTGGTGTTCGTGCTCGTGCTCGT
>strain_2_gyrB
ATGTCGGTTCGTGACGTGAAACCGGTCGCTGAAGGTATCGGTGCTGGTAAACGTGCTGGT
GCTGGTGTTGCTGGTGCTAAACGTGGTCGTGCTGGTGCTGGTGTTCGTGCTCGTGCTAAA
>strain_3_gyrB
ATGTCGGTTCGTGACGTGAAACCGGTCGCTGAAGGTATCGGTGCTGGTAAACGTGCTGGT
GCTGGTGTTGCTGGTGCTAAACGTGGTCGTGCTGGTGCTGGTGTTCGTGCTCGTGCTCAA

Production Example

>Escherichia_coli_K12_gyrB
ATGTCGGTTCGTGACGTGAAACCGGTCGCTGAAGGTATCGGTGCTGGTAAACGTGCTGGT
GCTGGTGTTGCTGGTGCTAAACGTGGTCGTGCTGGTGCTGGTGTTCGTGCTCGTGCTCGT
GTTGGTGCTGGTGTTCGTGCTGGTAAACGTGGTCGTGCTGGTGCTGGTGTTGCTGGTGCT
>Salmonella_enterica_gyrB
ATGTCGGTTCGTGACGTGAAACCGGTCGCTGAAGGTATCGGTGCTGGTAAACGTGCTGGT
GCTGGTGTTGCTGGTGCTAAACGTGGTCGTGCTGGTGCTGGTGTTCGTGCTCGTGCTAAA
GTTGGTGCTGGTGTTCGTGCTGGTAAACGTGGTCGTGCTGGTGCTGGTGTTGCTGGTGCT
>Klebsiella_pneumoniae_gyrB
ATGTCGGTTCGTGACGTGAAACCGGTCGCTGAAGGTATCGGTGCTGGTAAACGTGCTGGT
GCTGGTGTTGCTGGTGCTAAACGTGGTCGTGCTGGTGCTGGTGTTCGTGCTCGTGCTCAA
GTTGGTGCTGGTGTTCGTGCTGGTAAACGTGGTCGTGCTGGTGCTGGTGTTGCTGGTGCT

File Preparation Guidelines

1. Extract Coding Sequences

Identify gene boundaries

Use genome annotation or BLAST to find start and stop positions

Extract complete CDS

Include start codon (ATG) through stop codon (TAA/TAG/TGA)

Verify translation

Ensure sequences translate without internal stop codons

Remove stop codons (optional)

PROTÉGÉ can handle terminal stop codons, but removal is safer

2. Verify Sequence Quality

Check sequence length

# Quick Python check
from Bio import SeqIO

for record in SeqIO.parse("genes.fasta", "fasta"):
    if len(record.seq) % 3 != 0:
        print(f"Error: {record.id} length not divisible by 3")
    else:
        print(f"OK: {record.id} length = {len(record.seq)}")

Verify translation

from Bio import SeqIO

for record in SeqIO.parse("genes.fasta", "fasta"):
    protein = record.seq.translate()
    if "*" in protein[:-1]:  # Stop codon in middle
        print(f"Warning: {record.id} has internal stop codon")
    else:
        print(f"OK: {record.id} translates correctly")

Check for ambiguous bases

# Count N's in sequences
grep -v ">" genes.fasta | grep -o "N" | wc -l

# Find sequences with N's
grep -B 1 "N" genes.fasta | grep ">"

3. Sequence Alignment Preparation

PROTÉGÉ performs its own alignment using MUSCLE. Do not pre-align your sequences - provide unaligned FASTA files.

PROTÉGÉ’s internal workflow:

# From protege.py:138-139
amino = nuc.translate()
# Sequences are aligned at protein level (lines 179-182)
alnProc = subprocess.run(["muscle_lin", "-in", in_file, "-out", out_file])

Common Issues and Solutions

Issue: Translation Error

Error message: Seq.translate() got an invalid codon

Causes:

Sequence length not divisible by 3
Non-standard nucleotide characters
Sequence not in correct reading frame

Solution:

# Check sequence lengths
grep -v ">" genes.fasta | awk '{print length}'

# All numbers should be divisible by 3

Issue: Internal Stop Codons

Symptom: Truncated protein sequences or alignment errors

Causes:

Wrong reading frame
Sequencing errors
Pseudogenes or non-functional sequences

Solution:

Verify gene annotation
Check sequence orientation (may need reverse complement)
Remove problematic sequences

Issue: File Not Found

Error message: File not found or cannot open file

Causes:

File not in mounted directory
Incorrect filename in command
File permissions issue

Solution:

# List files in container
docker run --rm \
  --mount type=bind,source=/your/path/,target=/root/. \
  ddelgadillo/protege_base:v1.0.2 \
  ls -la /root/

Issue: Too Few Sequences

Recommendation: Use at least 5-10 sequences for meaningful consensus

Why:

Fewer sequences = less reliable consensus
May not capture sequence diversity
Primers may not work on related taxa

Solution:

Include representatives from target group
Balance between diversity and conservation

Issue: Sequences Too Divergent

Symptom: No consensus regions found or all primers highly degenerate

Causes:

Sequences from distantly related organisms
Wrong gene region selected
Mixed gene families

Solution:

Use more closely related sequences
Select more conserved gene regions
Reduce consensus threshold (-c 80 or lower)

Best Practices Summary

Sequence Selection

Use protein-coding genes only
Include 10-50 representative sequences
Balance conservation and diversity
Verify all sequences are same gene

File Preparation

Extract complete CDS (start to stop)
Verify length divisible by 3
Check translation for stop codons
Use descriptive filenames

Quality Control

Remove sequences with ambiguous bases
Verify correct reading frame
Check for sequencing errors
Ensure consistent gene boundaries

Organization

Keep sequences in dedicated directory
Use consistent naming conventions
Document sequence sources
Back up original files

Getting Started

Usage Guide

Core Concepts

Web Interface

Advanced

FASTA File Requirements

Basic Format

Critical Requirements

1. Protein-Coding Sequences Only

2. Sequence Length Requirements

3. Nucleotide Alphabet

File Naming Conventions

Supported Extensions

Naming Best Practices

Example FASTA Files

Minimal Example

Production Example

File Preparation Guidelines

1. Extract Coding Sequences

2. Verify Sequence Quality

3. Sequence Alignment Preparation

Common Issues and Solutions

Issue: Translation Error

Issue: Internal Stop Codons

Issue: File Not Found

Issue: Too Few Sequences

Issue: Sequences Too Divergent

Best Practices Summary

Sequence Selection

File Preparation

Quality Control

Organization

See Also

Build docs developers (and LLMs) love

Getting Started

Usage Guide

Core Concepts

Web Interface

Advanced

​FASTA File Requirements

​Basic Format

​Critical Requirements

​1. Protein-Coding Sequences Only

​2. Sequence Length Requirements

​3. Nucleotide Alphabet

​File Naming Conventions

​Supported Extensions

​Naming Best Practices

​Example FASTA Files

​Minimal Example

​Production Example

​File Preparation Guidelines

​1. Extract Coding Sequences

​2. Verify Sequence Quality

​3. Sequence Alignment Preparation

​Common Issues and Solutions

​Issue: Translation Error

​Issue: Internal Stop Codons

​Issue: File Not Found

​Issue: Too Few Sequences

​Issue: Sequences Too Divergent

​Best Practices Summary

Sequence Selection

File Preparation

Quality Control

Organization

​See Also

Build docs developers (and LLMs) love

FASTA File Requirements

Basic Format

Critical Requirements

1. Protein-Coding Sequences Only

2. Sequence Length Requirements

3. Nucleotide Alphabet

File Naming Conventions

Supported Extensions

Naming Best Practices

Example FASTA Files

Minimal Example

Production Example

File Preparation Guidelines

1. Extract Coding Sequences

2. Verify Sequence Quality

3. Sequence Alignment Preparation

Common Issues and Solutions

Issue: Translation Error

Issue: Internal Stop Codons

Issue: File Not Found

Issue: Too Few Sequences

Issue: Sequences Too Divergent

Best Practices Summary

See Also