Sequence Alignment

Overview

Sequence alignment is the foundation of phylogenetic primer design in PROTÉGÉ PD. The tool uses a codon-aware alignment strategy that aligns sequences at the protein level, then back-translates to preserve the nucleotide information needed for primer design.

Why Align at the Protein Level?

Protein-coding genes are more conserved at the amino acid level than at the nucleotide level due to:

Synonymous substitutions - multiple codons encode the same amino acid
Conservative substitutions - chemically similar amino acids are often exchangeable
Functional constraints - protein structure/function limits variation

Aligning DNA sequences directly can produce misalignments when synonymous substitutions occur. Protein-level alignment avoids this problem.

Example: The codons ATG, ATA, ATC, and ATT all encode similar amino acids (Met/Ile). At the DNA level, these look different, but at the protein level, the alignment is clear.

The Alignment Workflow

PROTÉGÉ PD implements a four-step alignment process:

1. Translation to Protein Sequences

Input nucleotide sequences are translated to amino acids using the standard genetic code:

# From protege.py:132-145
for seq_record in SeqIO.parse(genFile, 'fasta'):
    idList.append(seq_record.id)
    nuc = seq_record.seq._data.decode("utf-8")
    seqList.append(nuc)
    nucLen.append(len(nuc))
    nuc = Seq(nuc)
    amino = nuc.translate()
    amino = amino._data.decode("utf-8")
    aminoList.append(amino)
    aminoLen.append(len(amino))

Input: Nucleotide FASTA file

>Species_A
ATGGCTACCGTAAA
>Species_B  
ATGGCGACTAAGAA

Output: Translated amino acid sequences

>Species_A
MATVK
>Species_B
MATKK

2. Multiple Sequence Alignment with MUSCLE

Protein sequences are aligned using the MUSCLE algorithm (Multiple Sequence Comparison by Log-Expectation):

# From protege.py:163-183
muscle = pPath + '/muscle/muscle_lin'
in_file = translatedName
out_file = 'aligned_muscle_pl_' + translatedName

muscle_cline = str(muscle) + ' -in ' + str(in_file) + ' -out ' + out_file

alnProc = subprocess.run(["muscle_lin", "-in", in_file, "-out", out_file], 
                         capture_output=True)
align = AlignIO.read(open(out_file), "fasta")

MUSCLE is a fast, accurate multiple sequence alignment algorithm that uses progressive alignment with iterative refinement. It’s particularly effective for protein sequences and handles divergent sequences well.

Output: Aligned protein sequences with gaps

>Species_A
M-ATVK
>Species_B
MATVK-

3. Back-Translation to Nucleotide Alignment

The aligned protein sequences are back-translated to nucleotide codons, preserving the alignment structure:

# From protege.py:221-237
AlbyCod = []
for j in range(0, len(sequences)):
    tEq = aminoEq[j]  # Original codon equivalents
    tAl = sequences.al_amino_seq[j]  # Aligned amino acid sequence
    tAlbyCod = ''
    aminoCont = 0
    for i in range(0, len(tAl)):
        if tAl[i] == '-':  # Gap in alignment
            tAlbyCod = tAlbyCod + '---'  # Insert triple gap
        else:
            if tAl[i] == tEq[aminoCont][0]:
                tAlbyCod = tAlbyCod + str(tEq[aminoCont][1])  # Original codon
                aminoCont = aminoCont + 1
    AlbyCod.append(tAlbyCod)

Key principle: Each amino acid gap becomes a 3-nucleotide gap (---), and each aligned amino acid is replaced with its original codon. Output: Codon-aligned nucleotide sequences

>Species_A
ATG---GCTACCGTAAAA
>Species_B
ATGGCGACT------AAGAA

4. Consensus Sequence Generation

From the nucleotide alignment, a consensus sequence is built position-by-position:

# From protege.py:246-300
sConsensus = ''
for i in range(0, codAlLen):
    pcList = [0,0,0,0,0]  # [gap, G, A, C, T] counts
    for h in range(0, len(AlbyCod)):
        if AlbyCod[h][i] == '-':
            pcList[0] += 1
        elif AlbyCod[h][i] == 'G':
            pcList[1] += 1
        elif AlbyCod[h][i] == 'A':
            pcList[2] += 1
        elif AlbyCod[h][i] == 'C':
            pcList[3] += 1
        elif AlbyCod[h][i] == 'T':
            pcList[4] += 1
    
    # Convert to percentages
    pcList = np.asarray(pcList)
    pcList = (pcList / genNumber) * 100
    pcList = pcList.round(2)
    orderList = list(np.argsort(-pcList))

At each position, nucleotide frequencies are calculated and the consensus is determined based on:

Consensus Rules

Single nucleotide consensus (protege.py:275-277)

if pcList[orderList[0]] >= consensusPerc:
    sConsensus = sConsensus + str(sList[orderList[0]])

If one nucleotide exceeds the threshold (default 90%), use it.

Gap-dominant position (protege.py:278-280)

elif pcList[0] >= (100 - consensusPerc):
    sConsensus = sConsensus + str(sList[orderList[0]])

If gaps dominate (>10%), mark as gap unless --nogapconsensus is set.

Degenerate consensus (protege.py:281-296)

acc = pcList[orderList[0]]
accNuc = sList[orderList[0]]
for k in range(1, len(orderList)):
    acc += pcList[orderList[k]]
    accNuc = accNuc + sList[orderList[k]]
    if acc > consensusPerc:
        accNuc = sorted(list(accNuc))
        accNuc = ''.join(map(str, accNuc))
        accDeg = pt.degEquivalent(accNuc)  # Convert to IUPAC code
        sConsensus = sConsensus + accDeg
        break

If no single nucleotide exceeds threshold, accumulate nucleotides until the sum exceeds the threshold, then use the appropriate degeneracy code.

Example consensus calculation:

Position 42:
  - G: 45%
  - A: 48%  
  - C: 5%
  - T: 2%
  
With 90% threshold:
  G + A = 93% > 90%
  Nucleotides: AG (sorted)
  Degeneracy code: R
  
Consensus[42] = 'R'

Gap Handling

The --nogapconsensus flag controls how alignment gaps are treated:

With `--nogapconsensus` (default: True)

# From protege.py:268-270
if not gapConsensus and pcList[0] > 0:
    sConsensus = sConsensus + '-'

Any position with any gap (even 1%) is marked as - in the consensus. This ensures primers avoid indel-prone regions. Use case: Conservative primer design - avoid regions with insertions/deletions

Without `--nogapconsensus` (flag not set)

Gaps are treated as just another character in the frequency calculation. A position with 15% gaps and 85% ‘A’ would yield ‘A’ as consensus. Use case: Permissive primer design - accept some indel variation

Recommendation: Keep gap filtering enabled (default) for phylogenetic primers. Indel regions typically indicate structural variation that can cause primer binding failures.

Alignment Quality Metrics

After alignment, PROTÉGÉ PD reports:

# From protege.py:186-188
print("Alignment length %i" % align.get_alignment_length())
for record in align:
   print(record.seq._data.decode('utf-8') + " " + record.id)

Key metrics to evaluate:

Alignment length - should be approximately (input length / 3) for proper codon alignment
Gap distribution - excessive gaps may indicate misalignment or highly divergent sequences
Conserved regions - long stretches without gaps are ideal for primer placement

Troubleshooting Alignment Issues

Problem: Alignment length is not divisible by 3

Cause: Input sequences contain incomplete codons or are not in-frame.Solution: Ensure all input sequences:

Start at the beginning of the gene (start codon)
End at a complete codon
Do not contain frameshifts

Problem: Excessive gaps in alignment

Cause: Input sequences are highly divergent or include paralogs instead of orthologs.Solution:

Use more closely related sequences
Verify sequences are orthologous (same gene across species)
Remove obvious outlier sequences

Problem: Poor consensus (all degenerate or gaps)

Cause: Insufficient sequence conservation or low consensus threshold.Solution:

Increase the consensus threshold with -c flag (e.g., -c 95)
Use sequences from a narrower taxonomic range
Select a different gene with higher conservation

Output Files

The alignment process generates several intermediate files:

File	Description
`translated_seqs_pL.fas`	Translated amino acid sequences
`aligned_muscle_pl_translated_seqs_pL.fas`	MUSCLE-aligned protein sequences
`sequences.csv`	DataFrame with nucleotide and amino acid sequences
`alSequences.csv`	DataFrame with aligned sequences

These files are useful for:

Quality control - manually inspect alignments
Troubleshooting - identify problematic sequences
Downstream analysis - use aligned sequences for phylogenetic trees

Best Practices

✓ Use orthologous sequences - same gene from different species, not paralogs ✓ Pre-filter sequences - remove partial sequences or pseudogenes before alignment ✓ Check reading frame - ensure all sequences are in the same frame ✓ Inspect alignment - visually check the alignment for obvious errors ✓ Use appropriate consensus threshold - balance between specificity (high threshold) and taxonomic coverage (low threshold)

PhyloTag Approach - Overview of the methodology
Primer Degeneracy - How degenerate positions are encoded
Melting Temperature - Calculating primer binding temperatures

Getting Started

Usage Guide

Core Concepts

Web Interface

Advanced

Overview

Why Align at the Protein Level?

The Alignment Workflow

1. Translation to Protein Sequences

2. Multiple Sequence Alignment with MUSCLE

3. Back-Translation to Nucleotide Alignment

4. Consensus Sequence Generation

Gap Handling

With `--nogapconsensus` (default: True)

Without `--nogapconsensus` (flag not set)

Alignment Quality Metrics

Troubleshooting Alignment Issues

Output Files

Best Practices

Build docs developers (and LLMs) love

Getting Started

Usage Guide

Core Concepts

Web Interface

Advanced

​Overview

​Why Align at the Protein Level?

​The Alignment Workflow

​1. Translation to Protein Sequences

​2. Multiple Sequence Alignment with MUSCLE

​3. Back-Translation to Nucleotide Alignment

​4. Consensus Sequence Generation

​Gap Handling

​With --nogapconsensus (default: True)

​Without --nogapconsensus (flag not set)

​Alignment Quality Metrics

​Troubleshooting Alignment Issues

​Output Files

​Best Practices

​Related Concepts

Build docs developers (and LLMs) love

Overview

Why Align at the Protein Level?

The Alignment Workflow

1. Translation to Protein Sequences

2. Multiple Sequence Alignment with MUSCLE

3. Back-Translation to Nucleotide Alignment

4. Consensus Sequence Generation

Gap Handling

With `--nogapconsensus` (default: True)

Without `--nogapconsensus` (flag not set)

Alignment Quality Metrics

Troubleshooting Alignment Issues

Output Files

Best Practices

Related Concepts