Skip to main content

Project Overview

PROTÉGÉ PD is an open-source tool for phylogenetic primer design based on the Phylotag approach. The project welcomes contributions from the bioinformatics and software development communities. Repository: https://github.com/ddelgadillod/ProtegePD
Maintainer: Diego Delgadillo Duran
Contact: [email protected]
PROTÉGÉ PD stands for PROTEin coding GEne for phylogenetic tag and identification - Primer Design tool.

Ways to Contribute

Reporting Issues

Help improve PROTÉGÉ PD by reporting bugs, documentation errors, or unexpected behavior. Before submitting an issue:
  1. Search existing issues to avoid duplicates
  2. Verify the issue with the latest version (v1.0.2)
  3. Collect relevant information (error messages, input files, system details)
Creating an effective issue report:
**Description:**
Clear, concise description of the issue

**Steps to Reproduce:**
1. Run command: `docker run ... protege-pd -s sequences.fna`
2. Observe error at alignment step
3. ...

**Expected Behavior:**
What should happen

**Actual Behavior:**
What actually happens

**Environment:**
- OS: Ubuntu 22.04 / macOS 13 / Windows 11
- Docker version: 24.0.5
- PROTÉGÉ PD version: v1.0.2
- Number of sequences: 150
- Average sequence length: 1200 bp

**Error Messages:**
```text
Paste error messages or logs here
```text

**Additional Context:**
Any other relevant information
```text

<Accordion title="Example: Good Bug Report">
  **Title:** MUSCLE alignment fails with >500 sequences

  **Description:**
  When processing datasets with more than 500 sequences, the MUSCLE alignment step crashes with a memory error.

  **Steps to Reproduce:**
  1. Prepare FASTA file with 550 sequences (avg length 1500 bp)
  2. Run: `docker run --rm --mount type=bind,source=$(pwd),target=/root/. --name protege -p 127.0.0.1:8050:8050 --cpus 4 ddelgadillo/protege_base:v1.0.2 protege-pd -s large_dataset.fna`
  3. Wait for alignment step

  **Expected:** Alignment completes successfully  
  **Actual:** Container crashes with "Killed" message

  **Environment:**
  - OS: Ubuntu 22.04
  - Docker: 24.0.5 (8GB RAM allocated)
  - PROTÉGÉ: v1.0.2
  - Dataset: 550 sequences, 1500 bp average

  **Logs:**
  ```text
  Alignment length 520
  Killed
  ```text
</Accordion>

### Suggesting Features

Propose new features or enhancements to existing functionality.

**Feature request template:**

```text
**Feature Description:**
What functionality would you like to see?

**Use Case:**
Why is this feature valuable? What problem does it solve?

**Proposed Implementation:**
(Optional) How might this be implemented?

**Alternatives Considered:**
What other solutions have you considered?

**Priority:**
How critical is this feature for your workflow?
```text

<Note>
  Feature requests should align with PROTÉGÉ PD's core mission: phylogenetic primer design for protein-coding genes.
</Note>

### Improving Documentation

Documentation improvements are highly valuable:

- Fix typos or unclear explanations
- Add examples for common use cases
- Improve installation instructions
- Add troubleshooting tips
- Translate documentation (if applicable)

**To contribute documentation:**
1. Fork the repository
2. Edit README.md or create new documentation files
3. Submit a pull request with clear description of changes

## Development Setup

### Prerequisites

**Required software:**
- Python 3.10+
- Git
- Docker (for testing containerized version)
- Text editor or IDE (VSCode, PyCharm, etc.)

**Optional tools:**
- `black` for code formatting
- `pylint` for linting
- `pytest` for testing

### Clone the Repository

```bash
git clone https://github.com/ddelgadillod/ProtegePD.git
cd ProtegePD

Local Installation

Install MUSCLE binary:
chmod +x muscle/muscle_lin  # Linux
# Or download appropriate version for your OS
Install Python dependencies:
pip3 install -r requirements.txt
Key dependencies:
  • biopython==1.83 - Sequence analysis and MUSCLE integration
  • dash==2.14.2 - Web interface framework
  • pandas==2.2.0 - Data manipulation
  • plotly==5.18.0 - Interactive visualizations
  • numpy==1.26.3 - Numerical operations
  • scipy==1.12.0 - Statistical computations

Running Locally

# Make scripts executable
chmod +x protege.py phl.py

# Run with test data
./protege.py -s test_files/test.fasta

# Or use Python directly
python3 protege.py -s test_files/test.fasta
Test with provided datasets:
# Small test dataset
python3 protege.py -s test_files/test.fasta

# Medium dataset (gyrB gene)
python3 protege.py -s test_files/gyrB_genes.fas -c 90 -d 7

# Aligned sequences
python3 protege.py -s test_files/aligned_by_codons.fas
The test_files/ directory contains sample datasets for development and testing. These range from small test files to larger gyrB gene datasets.

Code Structure

Main Components

protege.py (/home/daytona/workspace/source/protege.py:1) Main application file containing:
  • Command-line argument parsing
  • Sequence reading and translation
  • MUSCLE alignment execution
  • Consensus calculation algorithm
  • Dash web application setup
  • Interactive plotting callbacks
Key functions in protege.py:
  • Sequence translation: Lines 132-161
  • MUSCLE alignment: Lines 179-183
  • Consensus calculation: Lines 246-300
  • Degeneracy computation: Lines 305-323
  • Dash callbacks: Lines 584-823
phl.py (/home/daytona/workspace/source/phl.py:1) Helper module with primer analysis classes and functions:
  • primerDeg class (lines 16-193): Primer degeneracy calculations
    • primerCheck(): Validate primer sequence
    • primerNP(): Count possible primers
    • primerComb(): Generate all primer combinations
    • TmWallace(), TmAp2(), TmAp3(), TmNN(): Melting temperature calculations
  • Plotting functions:
    • posDegScatter(): Forward primer scatter plot (lines 196-231)
    • zoomDegScatter(): Reverse primer scatter plot (lines 234-264)
  • Utility functions:
    • filterDF(): Filter by degeneracy range (lines 269-272)
    • degEquivalent(): Convert nucleotide combinations to IUPAC codes (lines 321-333)

Directory Structure

ProtegePD/
├── protege.py          # Main application
├── phl.py              # Primer analysis module
├── Dockerfile          # Container configuration
├── requirements.txt    # Python dependencies
├── install.sh          # Local installation script
├── README.md           # Project documentation
├── assets/             # Web interface assets (CSS, images)
├── muscle/
│   └── muscle_lin      # MUSCLE alignment binary (Linux)
├── src/                # Source code (alternative location)
│   ├── main.py
│   ├── protege.py
│   └── phl.py
└── test_files/         # Test datasets
    ├── test.fasta
    ├── gyrB_genes.fas
    ├── aligned_by_codons.fas
    └── *.fas (various test cases)
```text

### Algorithm Overview

**Primer design workflow:**

1. **Sequence Input** (`protege.py:132-144`)
   - Read FASTA file
   - Translate nucleotide sequences to amino acids
   - Store both nucleotide and amino acid sequences

2. **Multiple Sequence Alignment** (`protege.py:179-183`)
   - Align amino acid sequences using MUSCLE
   - Command: `muscle_lin -in translated.fas -out aligned.fas`

3. **Back-translation** (`protege.py:207-235`)
   - Map amino acid alignment to nucleotide codons
   - Preserve alignment gaps in nucleotide sequences

4. **Consensus Calculation** (`protege.py:246-300`)
   - Calculate nucleotide frequency at each position
   - Apply consensus threshold (default 90%)
   - Generate IUPAC degeneracy codes for ambiguous positions

5. **Primer Candidate Generation** (`protege.py:305-323`)
   - Slide window across consensus (default 21 bp = 7 codons)
   - Calculate degeneracy for each candidate
   - Filter candidates with gaps

6. **Melting Temperature Analysis** (`phl.py:162-193`)
   - Four calculation methods: Wallace, GC-based, Bio.SeqUtils
   - Generate distribution for all primer combinations

7. **Interactive Visualization** (`protege.py:437-577`)
   - Dash web interface on port 8050
   - Plotly scatter plots for primer selection
   - Real-time Tm distribution updates

## Making Code Contributions

### Development Workflow

1. **Fork the repository**
   - Visit https://github.com/ddelgadillod/ProtegePD
   - Click "Fork" button (top right)

2. **Create a feature branch**
   ```bash
   git checkout -b feature/your-feature-name
   # or
   git checkout -b bugfix/issue-description
  1. Make your changes
    • Write clean, documented code
    • Follow existing code style
    • Add comments for complex logic
  2. Test your changes
    # Test with multiple datasets
    python3 protege.py -s test_files/test.fasta
    python3 protege.py -s test_files/gyrB_genes.fas
    
    # Test different parameters
    python3 protege.py -s test_files/test.fasta -c 80 -d 6
    
  3. Commit your changes
    git add .
    git commit -m "Add feature: brief description"
    
    # Good commit messages:
    # "Fix MUSCLE alignment memory error for large datasets"
    # "Add support for custom melting temperature thresholds"
    # "Update documentation for Windows installation"
    
  4. Push to your fork
    git push origin feature/your-feature-name
    
  5. Create a Pull Request
    • Visit your fork on GitHub
    • Click “New Pull Request”
    • Provide clear description of changes

Pull Request Guidelines

PR description template:
## Description
Clear description of what this PR does

## Changes
- List of specific changes made
- Modified files and why

## Testing
- What tests were performed
- Test datasets used
- Expected vs actual results

## Related Issues
Fixes #123 (if applicable)

## Checklist
- [ ] Code follows existing style
- [ ] Comments added for complex logic
- [ ] Tested with multiple datasets
- [ ] Documentation updated (if needed)
- [ ] No breaking changes (or documented if necessary)
```text

<Warning>
  **Breaking changes** (changes that break backward compatibility) should be clearly documented and discussed with maintainers before implementation.
</Warning>

### Code Style Guidelines

**Python style:**
- Follow PEP 8 conventions
- Use descriptive variable names
- Add docstrings to functions and classes
- Keep functions focused (single responsibility)

**Example:**

```python
# Good
def calculate_consensus_position(nucleotide_list, threshold=90, allow_gaps=True):
    """
    Calculate consensus nucleotide at a specific alignment position.
    
    Args:
        nucleotide_list: List of nucleotides at this position
        threshold: Minimum percentage for consensus (default 90)
        allow_gaps: Whether to consider gaps in consensus (default True)
    
    Returns:
        str: Consensus nucleotide or IUPAC degeneracy code
    """
    # Implementation...
    pass

# Avoid
def calc(nl, t=90, g=True):
    # Implementation without documentation
    pass
Comments:
  • Explain why, not what
  • Document complex algorithms
  • Reference research papers for scientific methods
# Good
# Use Wallace's rule of thumb: Tm = 4(G+C) + 2(A+T)
# More accurate for primers 14-20 bp (Rychlik, 1990)
temp = 4 * gc_count + 2 * at_count

# Less helpful
# Calculate temperature
temp = 4 * gc_count + 2 * at_count

Testing

Manual Testing

Test with provided datasets:
# Small dataset (quick test)
python3 protege.py -s test_files/test.fasta

# Verify outputs created:
ls -la sequences.csv alSequences.csv protege_consensus.csv

# Check web interface:
# Open browser to http://127.0.0.1:8050
# Verify plots render correctly
Test parameter variations:
# Lower consensus threshold
python3 protege.py -s test_files/test.fasta -c 80

# Different codon length
python3 protege.py -s test_files/test.fasta -d 6

# Disable gap consensus
python3 protege.py -s test_files/test.fasta -g

# Verbose output
python3 protege.py -s test_files/test.fasta -v
  • Application starts without errors
  • MUSCLE alignment completes successfully
  • Output files created (sequences.csv, alSequences.csv, protege_consensus.csv)
  • Web interface loads at http://127.0.0.1:8050
  • Scatter plots display data points
  • Primer selection updates temperature distribution
  • CSV download works
  • No Python warnings or errors in console
  • Tested on multiple datasets (small, medium, large)
  • Parameter variations work as expected

Docker Testing

Test containerized version:
# Build local image
docker build -t protege-test:latest .

# Run with test data
cd test_files
docker run --rm \
  --mount type=bind,source=$(pwd),target=/root/. \
  --name protege-test -p 127.0.0.1:8050:8050 --cpus 4 \
  protege-test:latest \
  protege-pd -s test.fasta

# Verify container behavior
docker logs protege-test
docker stats protege-test

Contribution Areas

High-Priority Improvements

Opportunities:
  • Parallelize consensus calculation for large alignments
  • Optimize degeneracy computation (currently O(n²))
  • Implement caching for temperature calculations
  • Add progress bars for long-running operations
Implementation notes:
  • Use multiprocessing for CPU-bound tasks
  • Consider numba for numerical computations
  • Maintain backward compatibility with existing parameters
Opportunities:
  • Add primer specificity checking (BLAST integration)
  • Calculate GC clamps and secondary structures
  • Implement primer dimer detection
  • Add primer3 integration for comprehensive analysis
Implementation notes:
  • Keep as optional features (don’t break core workflow)
  • Consider external dependencies carefully
  • Provide clear documentation for new features
Opportunities:
  • Add primer pair validation visualization
  • Implement alignment viewer
  • Add export options (PDF, PNG for plots)
  • Improve mobile responsiveness
Implementation notes:
  • Use existing Plotly/Dash components
  • Test on multiple browsers
  • Maintain clean, minimal design aesthetic
Opportunities:
  • Add MUSCLE binaries for macOS and Windows
  • Create native installation options (pip package)
  • Improve Windows path handling
  • Add conda distribution
Implementation notes:
  • Test on all target platforms
  • Provide platform-specific documentation
  • Maintain Docker as primary distribution method

Documentation Needs

  • Tutorial videos - Walkthrough of complete workflow
  • Use case examples - Real-world phylogenetic studies
  • API documentation - For using modules programmatically
  • Troubleshooting guide expansion - More edge cases
  • Comparative analysis - PROTÉGÉ PD vs other primer design tools

Community Guidelines

Code of Conduct

  • Be respectful and constructive in discussions
  • Welcome newcomers and help them get started
  • Focus on the scientific merit and technical quality
  • Credit others’ contributions appropriately
  • Maintain professional communication

Getting Help

For development questions:
  • Open a GitHub Discussion (preferred for general questions)
  • Email maintainer: [email protected]
  • Reference relevant code sections with line numbers
For scientific questions:
  • Refer to Phylotag paper: Caro-Quintero, 2015
  • Discuss phylogenetic primer design principles
  • Share use cases and results

Recognition

Contributors will be recognized in:
  • GitHub contributors list
  • Future release notes
  • README acknowledgments section (for significant contributions)
All contributions, big or small, are valuable. Whether you fix a typo, report a bug, or implement a major feature, thank you for helping improve PROTÉGÉ PD!

License

By contributing to PROTÉGÉ PD, you agree that your contributions will be licensed under the same license as the project. Please verify the current license in the repository’s LICENSE file before contributing.

Build docs developers (and LLMs) love