Skip to main content

Output Files Overview

PROTÉGÉ PD generates multiple output files during the primer design process. All files are written to the same directory as your input FASTA file (the mounted directory).
Output files are automatically created in the mounted directory specified with --mount source=/your/path/,target=/root/.

Primary Output Files

1. protege_consensus.csv

Purpose: Main primer design results with all candidate primers and degeneracy calculations. Location: Same directory as input file Generated by: protege.py:326
phyloDF.to_csv('protege_consensus.csv',sep = ';', index = True)

File Structure

Column: position
string
Primer position in the consensus sequence as a range.Format: start-endExample: 1-21, 2-22, 3-23Position numbers correspond to nucleotide positions in the aligned consensus sequence.
Column: degeneracies
integer
Number of degenerate primer variants for this position.Calculation: Product of all degeneracies in the primer sequenceExamples:
  • 1 = No degeneracies (all standard nucleotides)
  • 2 = One 2-fold degenerate position (e.g., Y, R, W, S, K, M)
  • 4 = One 4-fold degenerate position (e.g., N, D, H, V, B)
  • 8 = Two 2-fold degenerate positions or combinations
Special case: 0 = Position contains gaps, primer not viable
Column: forwardPrimer
string
Forward primer sequence with IUPAC degenerate codes.IUPAC Codes Used:
  • A, T, G, C - Standard nucleotides
  • R = A or G (puRine)
  • Y = C or T (pYrimidine)
  • W = A or T (Weak)
  • S = G or C (Strong)
  • K = G or T (Keto)
  • M = A or C (aMino)
  • N = Any nucleotide
  • - = Gap (primer not usable)
Example: ATGCGRAAYWSKMNTGCAT
Column: reversePrimer
string
Reverse complement of the forward primer.Generated by: protege.py:310-311
frwd = Seq(frwdPrimer)
rvrsPrimer = frwd.reverse_complement()
Usage: Use this sequence for reverse PCR primer ordering

Example Content

;position;degeneracies;forwardPrimer;reversePrimer
0;1-21;1;ATGTCGGTTCGTGACGTGAAA;TTTCACGTCACGAACCGACAT
1;2-22;1;TGTCGGTTCGTGACGTGAAAC;GTTTCACGTCACGAACCGACA
2;3-23;2;GTCGGTTCGTGACGTGAAACR;YGTTTCACGTCACGAACCGAC
3;4-24;4;TCGGTTCGTGACGTGAAACGG;CCGTTTCACGTCACGAACCGA
4;5-25;0;CGGTTCGTGACGTGAAACGGT---;---ACCGTTTCACGTCACGAACCG
The first column (unnamed) is the row index. The semicolon (;) is used as the delimiter.

2. sequences.csv

Purpose: Original sequence information with nucleotide and amino acid sequences. Location: Same directory as input file Generated by: protege.py:155
sequences.to_csv('sequences.csv',sep = ';', index = True)

File Structure

Column: id
string
Sequence identifier from FASTA header (everything after >).
Column: nuc_seq
string
Original nucleotide sequence from input file.
Column: nuc_lenght
integer
Length of nucleotide sequence in base pairs.
Note the spelling: “nuc_lenght” (not “length”) as in the source code.
Column: amino_seq
string
Translated amino acid sequence.Generated by: protege.py:138
amino = nuc.translate()
Column: amino_lenght
integer
Length of amino acid sequence (nuc_lenght / 3).

Example Content

;id;nuc_seq;nuc_lenght;amino_seq;amino_lenght
0;strain_1_gyrB;ATGTCGGTTCGTGACGTGAAA...;180;MSVRDVKPVAEGIGA...;60
1;strain_2_gyrB;ATGTCGGTTCGTGACGTGAAA...;180;MSVRDVKPVAEGIGA...;60
2;strain_3_gyrB;ATGTCGGTTCGTGACGTGAAA...;180;MSVRDVKPVAEGIGA...;60

3. alSequences.csv

Purpose: Aligned amino acid sequences after MUSCLE alignment. Location: Same directory as input file Generated by: protege.py:205
alSequences.to_csv('alSequences.csv',sep = ';', index = True)

File Structure

Column: id
string
Sequence identifier (matches sequences.csv).
Column: al_amino_seq
string
Aligned amino acid sequence with gaps (-).All sequences in this column have the same length due to alignment.
Column: al_amino_lenght
integer
Length of aligned sequence (including gaps).This value is the same for all sequences in a run.

Example Content

;id;al_amino_seq;al_amino_lenght
0;strain_1_gyrB;MSVRDVKPVAEGIGA---LLAVA...;65
1;strain_2_gyrB;MSVRDVKPVAEGIGALLAVA...;65
2;strain_3_gyrB;MSVRDVKPVAEGIGARLLAVA...;65
Gaps in aligned sequences indicate insertions/deletions between sequences and affect primer design.

Intermediate Files

4. translated_seqs_pL.fas

Purpose: Amino acid sequences in FASTA format for MUSCLE alignment. Location: Same directory as input file Generated by: protege.py:157-161
f = open(translatedName, 'w')
for i in range(0,len(sequences)):
    f.write('>' + sequences.id[i] + '\n')
    f.write(sequences.amino_seq[i] + '\n')
f.close()

Example Content

>strain_1_gyrB
MSVRDVKPVAEGIGAGRAGVAGAKRGRAGAGVRARAR
>strain_2_gyrB
MSVRDVKPVAEGIGAGRAGVAGAKRGRAGAGVRARARK
>strain_3_gyrB
MSVRDVKPVAEGIGAGRAGVAGAKRGRAGAGVRARARQ

5. aligned_muscle_pl_*.fas

Purpose: MUSCLE-aligned amino acid sequences. Filename pattern: aligned_muscle_pl_translated_seqs_pL.fas Location: Same directory as input file Generated by: protege.py:179 (MUSCLE alignment)
alnProc = subprocess.run(["muscle_lin", "-in", in_file, "-out", out_file])

Example Content

>strain_1_gyrB
MSVRDVKPVAEGIGA---GRAGVAGAKRGRAGAGVRARARV
>strain_2_gyrB
MSVRDVKPVAEGIGAGRAGVAGAKRGRAGAGVRARARK
>strain_3_gyrB
MSVRDVKPVAEGIGARGRAGVAGAKRGRAGAGVRARARQ
This file shows the protein-level alignment used for consensus calculations. Gaps (---) represent insertions/deletions.

Downloading Files from Web Interface

When running PROTÉGÉ with the web interface (Dash), you can download results directly from the browser.

Access Output Files

1

Access web interface

Open http://127.0.0.1:8050 in your browser after starting PROTÉGÉ.
2

Wait for processing

PROTÉGÉ will display processing progress. Wait for completion message.
3

View results

Interactive visualization of primer candidates will appear.
4

Download files

Use the download buttons or links in the interface to save result files.

Access Files from Command Line

All output files are written to your mounted directory:
# List all output files
ls -lh /your/mounted/path/

# Expected files:
# - protege_consensus.csv (main results)
# - sequences.csv
# - alSequences.csv
# - translated_seqs_pL.fas
# - aligned_muscle_pl_*.fas

File Locations

All output files are created in the directory you mounted to /root/ in the Docker container.
Example:
# If you ran:
docker run --mount type=bind,source=/home/user/data/,target=/root/. ...

# Files are created in:
/home/user/data/protege_consensus.csv
/home/user/data/sequences.csv
/home/user/data/alSequences.csv
/home/user/data/translated_seqs_pL.fas
/home/user/data/aligned_muscle_pl_*.fas

Understanding Results

Selecting Best Primers

# Sort primers by degeneracy (lower is better)
sort -t';' -k3 -n protege_consensus.csv | head -20
Best primers:
  • Degeneracies = 1 (no degeneracy)
  • Degeneracies = 2-4 (low degeneracy, good)
  • Degeneracies = 8-16 (moderate, acceptable)
  • Degeneracies > 32 (high, may be problematic)
# Show only primers without gaps
grep -v '---' protege_consensus.csv | grep -v '^;position'
Primers containing gaps (degeneracies = 0) cannot be synthesized.
# Get primers from specific region (e.g., positions 100-200)
awk -F';' '$2 ~ /^1[0-9][0-9]-/' protege_consensus.csv
Select primers from conserved gene regions if known.

Primer Quality Metrics

Excellent Primers

  • Degeneracies: 1
  • No gaps
  • From conserved regions
  • Standard nucleotides only

Good Primers

  • Degeneracies: 2-8
  • No gaps
  • Limited degenerate positions
  • Mostly standard nucleotides

Acceptable Primers

  • Degeneracies: 8-32
  • No gaps
  • Multiple degenerate positions
  • May require optimization

Problematic Primers

  • Degeneracies: >32 or 0
  • Contains gaps
  • Highly degenerate
  • Difficult to synthesize

Post-Processing Analysis

Import into Excel/Spreadsheet

# Convert semicolon-delimited to comma-delimited
sed 's/;/,/g' protege_consensus.csv > protege_consensus_comma.csv

# Open in Excel, LibreOffice, or Google Sheets

Python Analysis

import pandas as pd

# Load results
df = pd.read_csv('protege_consensus.csv', sep=';', index_col=0)

# Filter primers with degeneracy ≤ 4
good_primers = df[df['degeneracies'] <= 4]
good_primers = good_primers[good_primers['degeneracies'] > 0]

print(f"Found {len(good_primers)} high-quality primers")
print(good_primers[['position', 'degeneracies', 'forwardPrimer']].head(10))

# Export filtered results
good_primers.to_csv('selected_primers.csv', sep=';')

R Analysis

# Load results
df <- read.csv('protege_consensus.csv', sep=';')

# Filter and sort
good_primers <- df[df$degeneracies > 0 & df$degeneracies <= 8, ]
good_primers <- good_primers[order(good_primers$degeneracies), ]

# View top candidates
head(good_primers, 20)

# Export
write.csv(good_primers, 'selected_primers.csv', row.names=FALSE)

Troubleshooting

Possible causes:
  • PROTÉGÉ encountered an error during processing
  • Insufficient disk space
  • Permission issues in mounted directory
Solutions:
  • Check terminal output for error messages
  • Verify write permissions: ls -la /your/path/
  • Ensure adequate free space: df -h
Possible causes:
  • Input sequences too divergent
  • Consensus threshold too low
  • Wrong gene region selected
Solutions:
  • Increase consensus threshold: -c 95
  • Use more closely related sequences
  • Select more conserved genes
  • Check sequence quality and alignment
Possible causes:
  • Sequences have many insertions/deletions
  • Gap consensus enabled with variable sequences
Solutions:
  • Use -g flag to exclude gap positions
  • Trim sequences to conserved regions
  • Remove outlier sequences with many indels
Possible causes:
  • Semicolon delimiter not recognized
  • Regional settings expect different delimiter
Solutions:
  • Convert to comma: sed 's/;/,/g' file.csv > file_comma.csv
  • Use Excel import wizard and specify ; delimiter
  • Open in Python/R with sep=';' parameter

See Also

Build docs developers (and LLMs) love