Contributing to PROTÉGÉ PD

Project Overview

PROTÉGÉ PD is an open-source tool for phylogenetic primer design based on the Phylotag approach. The project welcomes contributions from the bioinformatics and software development communities. Repository: https://github.com/ddelgadillod/ProtegePD
Maintainer: Diego Delgadillo Duran
Contact: [email protected]

PROTÉGÉ PD stands for PROTEin coding GEne for phylogenetic tag and identification - Primer Design tool.

Ways to Contribute

Reporting Issues

Help improve PROTÉGÉ PD by reporting bugs, documentation errors, or unexpected behavior. Before submitting an issue:

Search existing issues to avoid duplicates
Verify the issue with the latest version (v1.0.2)
Collect relevant information (error messages, input files, system details)

Creating an effective issue report:

**Description:**
Clear, concise description of the issue

**Steps to Reproduce:**
1. Run command: `docker run ... protege-pd -s sequences.fna`
2. Observe error at alignment step
3. ...

**Expected Behavior:**
What should happen

**Actual Behavior:**
What actually happens

**Environment:**
- OS: Ubuntu 22.04 / macOS 13 / Windows 11
- Docker version: 24.0.5
- PROTÉGÉ PD version: v1.0.2
- Number of sequences: 150
- Average sequence length: 1200 bp

**Error Messages:**
```text
Paste error messages or logs here
```text

**Additional Context:**
Any other relevant information
```text

<Accordion title="Example: Good Bug Report">
  **Title:** MUSCLE alignment fails with >500 sequences

  **Description:**
  When processing datasets with more than 500 sequences, the MUSCLE alignment step crashes with a memory error.

  **Steps to Reproduce:**
  1. Prepare FASTA file with 550 sequences (avg length 1500 bp)
  2. Run: `docker run --rm --mount type=bind,source=$(pwd),target=/root/. --name protege -p 127.0.0.1:8050:8050 --cpus 4 ddelgadillo/protege_base:v1.0.2 protege-pd -s large_dataset.fna`
  3. Wait for alignment step

  **Expected:** Alignment completes successfully  
  **Actual:** Container crashes with "Killed" message

  **Environment:**
  - OS: Ubuntu 22.04
  - Docker: 24.0.5 (8GB RAM allocated)
  - PROTÉGÉ: v1.0.2
  - Dataset: 550 sequences, 1500 bp average

  **Logs:**
  ```text
  Alignment length 520
  Killed
  ```text
</Accordion>

### Suggesting Features

Propose new features or enhancements to existing functionality.

**Feature request template:**

```text
**Feature Description:**
What functionality would you like to see?

**Use Case:**
Why is this feature valuable? What problem does it solve?

**Proposed Implementation:**
(Optional) How might this be implemented?

**Alternatives Considered:**
What other solutions have you considered?

**Priority:**
How critical is this feature for your workflow?
```text

<Note>
  Feature requests should align with PROTÉGÉ PD's core mission: phylogenetic primer design for protein-coding genes.
</Note>

### Improving Documentation

Documentation improvements are highly valuable:

- Fix typos or unclear explanations
- Add examples for common use cases
- Improve installation instructions
- Add troubleshooting tips
- Translate documentation (if applicable)

**To contribute documentation:**
1. Fork the repository
2. Edit README.md or create new documentation files
3. Submit a pull request with clear description of changes

## Development Setup

### Prerequisites

**Required software:**
- Python 3.10+
- Git
- Docker (for testing containerized version)
- Text editor or IDE (VSCode, PyCharm, etc.)

**Optional tools:**
- `black` for code formatting
- `pylint` for linting
- `pytest` for testing

### Clone the Repository

```bash
git clone https://github.com/ddelgadillod/ProtegePD.git
cd ProtegePD

Local Installation

Install MUSCLE binary:

chmod +x muscle/muscle_lin  # Linux
# Or download appropriate version for your OS

Install Python dependencies:

pip3 install -r requirements.txt

Key dependencies:

biopython==1.83 - Sequence analysis and MUSCLE integration
dash==2.14.2 - Web interface framework
pandas==2.2.0 - Data manipulation
plotly==5.18.0 - Interactive visualizations
numpy==1.26.3 - Numerical operations
scipy==1.12.0 - Statistical computations

Running Locally

# Make scripts executable
chmod +x protege.py phl.py

# Run with test data
./protege.py -s test_files/test.fasta

# Or use Python directly
python3 protege.py -s test_files/test.fasta

Test with provided datasets:

# Small test dataset
python3 protege.py -s test_files/test.fasta

# Medium dataset (gyrB gene)
python3 protege.py -s test_files/gyrB_genes.fas -c 90 -d 7

# Aligned sequences
python3 protege.py -s test_files/aligned_by_codons.fas

The test_files/ directory contains sample datasets for development and testing. These range from small test files to larger gyrB gene datasets.

Code Structure

Main Components

protege.py (/home/daytona/workspace/source/protege.py:1) Main application file containing:

Command-line argument parsing
Sequence reading and translation
MUSCLE alignment execution
Consensus calculation algorithm
Dash web application setup
Interactive plotting callbacks

Key functions in protege.py:

Sequence translation: Lines 132-161
MUSCLE alignment: Lines 179-183
Consensus calculation: Lines 246-300
Degeneracy computation: Lines 305-323
Dash callbacks: Lines 584-823

phl.py (/home/daytona/workspace/source/phl.py:1) Helper module with primer analysis classes and functions:

primerDeg class (lines 16-193): Primer degeneracy calculations
- primerCheck(): Validate primer sequence
- primerNP(): Count possible primers
- primerComb(): Generate all primer combinations
- TmWallace(), TmAp2(), TmAp3(), TmNN(): Melting temperature calculations
Plotting functions:
- posDegScatter(): Forward primer scatter plot (lines 196-231)
- zoomDegScatter(): Reverse primer scatter plot (lines 234-264)
Utility functions:
- filterDF(): Filter by degeneracy range (lines 269-272)
- degEquivalent(): Convert nucleotide combinations to IUPAC codes (lines 321-333)

Directory Structure

ProtegePD/
├── protege.py          # Main application
├── phl.py              # Primer analysis module
├── Dockerfile          # Container configuration
├── requirements.txt    # Python dependencies
├── install.sh          # Local installation script
├── README.md           # Project documentation
├── assets/             # Web interface assets (CSS, images)
├── muscle/
│   └── muscle_lin      # MUSCLE alignment binary (Linux)
├── src/                # Source code (alternative location)
│   ├── main.py
│   ├── protege.py
│   └── phl.py
└── test_files/         # Test datasets
    ├── test.fasta
    ├── gyrB_genes.fas
    ├── aligned_by_codons.fas
    └── *.fas (various test cases)
```text

### Algorithm Overview

**Primer design workflow:**

1. **Sequence Input** (`protege.py:132-144`)
   - Read FASTA file
   - Translate nucleotide sequences to amino acids
   - Store both nucleotide and amino acid sequences

2. **Multiple Sequence Alignment** (`protege.py:179-183`)
   - Align amino acid sequences using MUSCLE
   - Command: `muscle_lin -in translated.fas -out aligned.fas`

3. **Back-translation** (`protege.py:207-235`)
   - Map amino acid alignment to nucleotide codons
   - Preserve alignment gaps in nucleotide sequences

4. **Consensus Calculation** (`protege.py:246-300`)
   - Calculate nucleotide frequency at each position
   - Apply consensus threshold (default 90%)
   - Generate IUPAC degeneracy codes for ambiguous positions

5. **Primer Candidate Generation** (`protege.py:305-323`)
   - Slide window across consensus (default 21 bp = 7 codons)
   - Calculate degeneracy for each candidate
   - Filter candidates with gaps

6. **Melting Temperature Analysis** (`phl.py:162-193`)
   - Four calculation methods: Wallace, GC-based, Bio.SeqUtils
   - Generate distribution for all primer combinations

7. **Interactive Visualization** (`protege.py:437-577`)
   - Dash web interface on port 8050
   - Plotly scatter plots for primer selection
   - Real-time Tm distribution updates

## Making Code Contributions

### Development Workflow

1. **Fork the repository**
   - Visit https://github.com/ddelgadillod/ProtegePD
   - Click "Fork" button (top right)

2. **Create a feature branch**
   ```bash
   git checkout -b feature/your-feature-name
   # or
   git checkout -b bugfix/issue-description

Make your changes
- Write clean, documented code
- Follow existing code style
- Add comments for complex logic

Test your changes

# Test with multiple datasets
python3 protege.py -s test_files/test.fasta
python3 protege.py -s test_files/gyrB_genes.fas

# Test different parameters
python3 protege.py -s test_files/test.fasta -c 80 -d 6

Commit your changes

git add .
git commit -m "Add feature: brief description"

# Good commit messages:
# "Fix MUSCLE alignment memory error for large datasets"
# "Add support for custom melting temperature thresholds"
# "Update documentation for Windows installation"

Push to your fork

git push origin feature/your-feature-name

Create a Pull Request
- Visit your fork on GitHub
- Click “New Pull Request”
- Provide clear description of changes

Pull Request Guidelines

PR description template:

## Description
Clear description of what this PR does

## Changes
- List of specific changes made
- Modified files and why

## Testing
- What tests were performed
- Test datasets used
- Expected vs actual results

## Related Issues
Fixes #123 (if applicable)

## Checklist
- [ ] Code follows existing style
- [ ] Comments added for complex logic
- [ ] Tested with multiple datasets
- [ ] Documentation updated (if needed)
- [ ] No breaking changes (or documented if necessary)
```text

<Warning>
  **Breaking changes** (changes that break backward compatibility) should be clearly documented and discussed with maintainers before implementation.
</Warning>

### Code Style Guidelines

**Python style:**
- Follow PEP 8 conventions
- Use descriptive variable names
- Add docstrings to functions and classes
- Keep functions focused (single responsibility)

**Example:**

```python
# Good
def calculate_consensus_position(nucleotide_list, threshold=90, allow_gaps=True):
    """
    Calculate consensus nucleotide at a specific alignment position.
    
    Args:
        nucleotide_list: List of nucleotides at this position
        threshold: Minimum percentage for consensus (default 90)
        allow_gaps: Whether to consider gaps in consensus (default True)
    
    Returns:
        str: Consensus nucleotide or IUPAC degeneracy code
    """
    # Implementation...
    pass

# Avoid
def calc(nl, t=90, g=True):
    # Implementation without documentation
    pass

Comments:

Explain why, not what
Document complex algorithms
Reference research papers for scientific methods

# Good
# Use Wallace's rule of thumb: Tm = 4(G+C) + 2(A+T)
# More accurate for primers 14-20 bp (Rychlik, 1990)
temp = 4 * gc_count + 2 * at_count

# Less helpful
# Calculate temperature
temp = 4 * gc_count + 2 * at_count

Testing

Manual Testing

Test with provided datasets:

# Small dataset (quick test)
python3 protege.py -s test_files/test.fasta

# Verify outputs created:
ls -la sequences.csv alSequences.csv protege_consensus.csv

# Check web interface:
# Open browser to http://127.0.0.1:8050
# Verify plots render correctly

Test parameter variations:

# Lower consensus threshold
python3 protege.py -s test_files/test.fasta -c 80

# Different codon length
python3 protege.py -s test_files/test.fasta -d 6

# Disable gap consensus
python3 protege.py -s test_files/test.fasta -g

# Verbose output
python3 protege.py -s test_files/test.fasta -v

Testing Checklist

Docker Testing

Test containerized version:

# Build local image
docker build -t protege-test:latest .

# Run with test data
cd test_files
docker run --rm \
  --mount type=bind,source=$(pwd),target=/root/. \
  --name protege-test -p 127.0.0.1:8050:8050 --cpus 4 \
  protege-test:latest \
  protege-pd -s test.fasta

# Verify container behavior
docker logs protege-test
docker stats protege-test

Contribution Areas

High-Priority Improvements

Performance Optimization

Opportunities:

Parallelize consensus calculation for large alignments
Optimize degeneracy computation (currently O(n²))
Implement caching for temperature calculations
Add progress bars for long-running operations

Implementation notes:

Use multiprocessing for CPU-bound tasks
Consider numba for numerical computations
Maintain backward compatibility with existing parameters

Enhanced Primer Analysis

Opportunities:

Add primer specificity checking (BLAST integration)
Calculate GC clamps and secondary structures
Implement primer dimer detection
Add primer3 integration for comprehensive analysis

Implementation notes:

Keep as optional features (don’t break core workflow)
Consider external dependencies carefully
Provide clear documentation for new features

Improved User Interface

Opportunities:

Add primer pair validation visualization
Implement alignment viewer
Add export options (PDF, PNG for plots)
Improve mobile responsiveness

Implementation notes:

Use existing Plotly/Dash components
Test on multiple browsers
Maintain clean, minimal design aesthetic

Cross-Platform Support

Opportunities:

Add MUSCLE binaries for macOS and Windows
Create native installation options (pip package)
Improve Windows path handling
Add conda distribution

Implementation notes:

Test on all target platforms
Provide platform-specific documentation
Maintain Docker as primary distribution method

Documentation Needs

Tutorial videos - Walkthrough of complete workflow
Use case examples - Real-world phylogenetic studies
API documentation - For using modules programmatically
Troubleshooting guide expansion - More edge cases
Comparative analysis - PROTÉGÉ PD vs other primer design tools

Community Guidelines

Code of Conduct

Be respectful and constructive in discussions
Welcome newcomers and help them get started
Focus on the scientific merit and technical quality
Credit others’ contributions appropriately
Maintain professional communication

Getting Help

For development questions:

Open a GitHub Discussion (preferred for general questions)
Email maintainer: [email protected]
Reference relevant code sections with line numbers

For scientific questions:

Refer to Phylotag paper: Caro-Quintero, 2015
Discuss phylogenetic primer design principles
Share use cases and results

Recognition

Contributors will be recognized in:

GitHub contributors list
Future release notes
README acknowledgments section (for significant contributions)

All contributions, big or small, are valuable. Whether you fix a typo, report a bug, or implement a major feature, thank you for helping improve PROTÉGÉ PD!

License

By contributing to PROTÉGÉ PD, you agree that your contributions will be licensed under the same license as the project. Please verify the current license in the repository’s LICENSE file before contributing.

Getting Started

Usage Guide

Core Concepts

Web Interface

Advanced

Contributing to PROTÉGÉ PD

Project Overview

Ways to Contribute

Reporting Issues

Local Installation

Running Locally

Code Structure

Main Components

Directory Structure

Pull Request Guidelines

Testing

Manual Testing

Docker Testing

Contribution Areas

High-Priority Improvements

Documentation Needs

Community Guidelines

Code of Conduct

Getting Help

Recognition

License

Build docs developers (and LLMs) love

Getting Started

Usage Guide

Core Concepts

Web Interface

Advanced

​Project Overview

​Ways to Contribute

​Reporting Issues

​Local Installation

​Running Locally

​Code Structure

​Main Components

​Directory Structure

​Pull Request Guidelines

​Testing

​Manual Testing

​Docker Testing

​Contribution Areas

​High-Priority Improvements

​Documentation Needs

​Community Guidelines

​Code of Conduct

​Getting Help

​Recognition

​License

Build docs developers (and LLMs) love

Project Overview

Ways to Contribute

Reporting Issues

Local Installation

Running Locally

Code Structure

Main Components

Directory Structure

Pull Request Guidelines

Testing

Manual Testing

Docker Testing

Contribution Areas

High-Priority Improvements

Documentation Needs

Community Guidelines

Code of Conduct

Getting Help

Recognition

License