Skip to main content

Overview

Sequence alignment is the foundation of phylogenetic primer design in PROTÉGÉ PD. The tool uses a codon-aware alignment strategy that aligns sequences at the protein level, then back-translates to preserve the nucleotide information needed for primer design.

Why Align at the Protein Level?

Protein-coding genes are more conserved at the amino acid level than at the nucleotide level due to:
  • Synonymous substitutions - multiple codons encode the same amino acid
  • Conservative substitutions - chemically similar amino acids are often exchangeable
  • Functional constraints - protein structure/function limits variation
Aligning DNA sequences directly can produce misalignments when synonymous substitutions occur. Protein-level alignment avoids this problem.
Example: The codons ATG, ATA, ATC, and ATT all encode similar amino acids (Met/Ile). At the DNA level, these look different, but at the protein level, the alignment is clear.

The Alignment Workflow

PROTÉGÉ PD implements a four-step alignment process:

1. Translation to Protein Sequences

Input nucleotide sequences are translated to amino acids using the standard genetic code:
# From protege.py:132-145
for seq_record in SeqIO.parse(genFile, 'fasta'):
    idList.append(seq_record.id)
    nuc = seq_record.seq._data.decode("utf-8")
    seqList.append(nuc)
    nucLen.append(len(nuc))
    nuc = Seq(nuc)
    amino = nuc.translate()
    amino = amino._data.decode("utf-8")
    aminoList.append(amino)
    aminoLen.append(len(amino))
Input: Nucleotide FASTA file
>Species_A
ATGGCTACCGTAAA
>Species_B  
ATGGCGACTAAGAA
Output: Translated amino acid sequences
>Species_A
MATVK
>Species_B
MATKK

2. Multiple Sequence Alignment with MUSCLE

Protein sequences are aligned using the MUSCLE algorithm (Multiple Sequence Comparison by Log-Expectation):
# From protege.py:163-183
muscle = pPath + '/muscle/muscle_lin'
in_file = translatedName
out_file = 'aligned_muscle_pl_' + translatedName

muscle_cline = str(muscle) + ' -in ' + str(in_file) + ' -out ' + out_file

alnProc = subprocess.run(["muscle_lin", "-in", in_file, "-out", out_file], 
                         capture_output=True)
align = AlignIO.read(open(out_file), "fasta")
MUSCLE is a fast, accurate multiple sequence alignment algorithm that uses progressive alignment with iterative refinement. It’s particularly effective for protein sequences and handles divergent sequences well.
Output: Aligned protein sequences with gaps
>Species_A
M-ATVK
>Species_B
MATVK-

3. Back-Translation to Nucleotide Alignment

The aligned protein sequences are back-translated to nucleotide codons, preserving the alignment structure:
# From protege.py:221-237
AlbyCod = []
for j in range(0, len(sequences)):
    tEq = aminoEq[j]  # Original codon equivalents
    tAl = sequences.al_amino_seq[j]  # Aligned amino acid sequence
    tAlbyCod = ''
    aminoCont = 0
    for i in range(0, len(tAl)):
        if tAl[i] == '-':  # Gap in alignment
            tAlbyCod = tAlbyCod + '---'  # Insert triple gap
        else:
            if tAl[i] == tEq[aminoCont][0]:
                tAlbyCod = tAlbyCod + str(tEq[aminoCont][1])  # Original codon
                aminoCont = aminoCont + 1
    AlbyCod.append(tAlbyCod)
Key principle: Each amino acid gap becomes a 3-nucleotide gap (---), and each aligned amino acid is replaced with its original codon. Output: Codon-aligned nucleotide sequences
>Species_A
ATG---GCTACCGTAAAA
>Species_B
ATGGCGACT------AAGAA

4. Consensus Sequence Generation

From the nucleotide alignment, a consensus sequence is built position-by-position:
# From protege.py:246-300
sConsensus = ''
for i in range(0, codAlLen):
    pcList = [0,0,0,0,0]  # [gap, G, A, C, T] counts
    for h in range(0, len(AlbyCod)):
        if AlbyCod[h][i] == '-':
            pcList[0] += 1
        elif AlbyCod[h][i] == 'G':
            pcList[1] += 1
        elif AlbyCod[h][i] == 'A':
            pcList[2] += 1
        elif AlbyCod[h][i] == 'C':
            pcList[3] += 1
        elif AlbyCod[h][i] == 'T':
            pcList[4] += 1
    
    # Convert to percentages
    pcList = np.asarray(pcList)
    pcList = (pcList / genNumber) * 100
    pcList = pcList.round(2)
    orderList = list(np.argsort(-pcList))
At each position, nucleotide frequencies are calculated and the consensus is determined based on:
  1. Single nucleotide consensus (protege.py:275-277)
    if pcList[orderList[0]] >= consensusPerc:
        sConsensus = sConsensus + str(sList[orderList[0]])
    
    If one nucleotide exceeds the threshold (default 90%), use it.
  2. Gap-dominant position (protege.py:278-280)
    elif pcList[0] >= (100 - consensusPerc):
        sConsensus = sConsensus + str(sList[orderList[0]])
    
    If gaps dominate (>10%), mark as gap unless --nogapconsensus is set.
  3. Degenerate consensus (protege.py:281-296)
    acc = pcList[orderList[0]]
    accNuc = sList[orderList[0]]
    for k in range(1, len(orderList)):
        acc += pcList[orderList[k]]
        accNuc = accNuc + sList[orderList[k]]
        if acc > consensusPerc:
            accNuc = sorted(list(accNuc))
            accNuc = ''.join(map(str, accNuc))
            accDeg = pt.degEquivalent(accNuc)  # Convert to IUPAC code
            sConsensus = sConsensus + accDeg
            break
    
    If no single nucleotide exceeds threshold, accumulate nucleotides until the sum exceeds the threshold, then use the appropriate degeneracy code.
Example consensus calculation:
Position 42:
  - G: 45%
  - A: 48%  
  - C: 5%
  - T: 2%
  
With 90% threshold:
  G + A = 93% > 90%
  Nucleotides: AG (sorted)
  Degeneracy code: R
  
Consensus[42] = 'R'

Gap Handling

The --nogapconsensus flag controls how alignment gaps are treated:

With --nogapconsensus (default: True)

# From protege.py:268-270
if not gapConsensus and pcList[0] > 0:
    sConsensus = sConsensus + '-'
Any position with any gap (even 1%) is marked as - in the consensus. This ensures primers avoid indel-prone regions. Use case: Conservative primer design - avoid regions with insertions/deletions

Without --nogapconsensus (flag not set)

Gaps are treated as just another character in the frequency calculation. A position with 15% gaps and 85% ‘A’ would yield ‘A’ as consensus. Use case: Permissive primer design - accept some indel variation
Recommendation: Keep gap filtering enabled (default) for phylogenetic primers. Indel regions typically indicate structural variation that can cause primer binding failures.

Alignment Quality Metrics

After alignment, PROTÉGÉ PD reports:
# From protege.py:186-188
print("Alignment length %i" % align.get_alignment_length())
for record in align:
   print(record.seq._data.decode('utf-8') + " " + record.id)
Key metrics to evaluate:
  • Alignment length - should be approximately (input length / 3) for proper codon alignment
  • Gap distribution - excessive gaps may indicate misalignment or highly divergent sequences
  • Conserved regions - long stretches without gaps are ideal for primer placement

Troubleshooting Alignment Issues

Cause: Input sequences contain incomplete codons or are not in-frame.Solution: Ensure all input sequences:
  • Start at the beginning of the gene (start codon)
  • End at a complete codon
  • Do not contain frameshifts
Cause: Input sequences are highly divergent or include paralogs instead of orthologs.Solution:
  • Use more closely related sequences
  • Verify sequences are orthologous (same gene across species)
  • Remove obvious outlier sequences
Cause: Insufficient sequence conservation or low consensus threshold.Solution:
  • Increase the consensus threshold with -c flag (e.g., -c 95)
  • Use sequences from a narrower taxonomic range
  • Select a different gene with higher conservation

Output Files

The alignment process generates several intermediate files:
FileDescription
translated_seqs_pL.fasTranslated amino acid sequences
aligned_muscle_pl_translated_seqs_pL.fasMUSCLE-aligned protein sequences
sequences.csvDataFrame with nucleotide and amino acid sequences
alSequences.csvDataFrame with aligned sequences
These files are useful for:
  • Quality control - manually inspect alignments
  • Troubleshooting - identify problematic sequences
  • Downstream analysis - use aligned sequences for phylogenetic trees

Best Practices

Use orthologous sequences - same gene from different species, not paralogs Pre-filter sequences - remove partial sequences or pseudogenes before alignment Check reading frame - ensure all sequences are in the same frame Inspect alignment - visually check the alignment for obvious errors Use appropriate consensus threshold - balance between specificity (high threshold) and taxonomic coverage (low threshold)

Build docs developers (and LLMs) love