Skip to main content

Dockerfile Architecture

PROTÉGÉ PD’s Docker configuration is built on Ubuntu 22.04 and provides a complete Python 3.10 environment with all necessary bioinformatics dependencies.

Base Image and Environment

The Dockerfile starts with Ubuntu 22.04 LTS and sets up three key environment variables:
FROM ubuntu:22.04

ENV SRC /usr/local/src/
ENV BIN /usr/local/bin/
ENV HOME /root/
These environment variables define standard locations for source code (/usr/local/src/), binaries (/usr/local/bin/), and the home directory (/root/).

System Dependencies

The container installs essential system packages:
RUN apt-get update && apt-get install -y python3.10 python3-pip
RUN apt-get install -y git
RUN apt-get install -y vim
RUN apt-get install -y htop
RUN apt-get install -y wget
Installed packages:
  • python3.10 - Required Python version for Biopython compatibility
  • python3-pip - Package manager for Python dependencies
  • git - Version control for cloning the repository
  • vim - Text editor for container debugging
  • htop - Process monitoring tool
  • wget - Download utility for MUSCLE binary

MUSCLE Binary Installation

PROTÉGÉ PD requires MUSCLE (Multiple Sequence Comparison by Log-Expectation) for protein sequence alignment:
WORKDIR $BIN
RUN wget https://github.com/ddelgadillod/ProtegePD/raw/main/muscle/muscle_lin
RUN chmod +x muscle_lin
The muscle_lin binary is the Linux version of MUSCLE v3.8.31. The container makes it executable and places it in /usr/local/bin/ for global access.

Application Setup

The Dockerfile clones the repository and installs all Python dependencies:
WORKDIR $SRC
RUN git clone https://github.com/ddelgadillod/ProtegePD
RUN pip3 install -r ProtegePD/requirements.txt
RUN cp ProtegePD/*.py $BIN
RUN cp -r ProtegePD/assets $BIN
RUN cp -r ProtegePD/test_files $HOME
Python dependencies (from requirements.txt):
  • biopython 1.83 - Core bioinformatics library for sequence manipulation
  • dash 2.14.2 - Web application framework for the GUI
  • pandas 2.2.0 - Data manipulation and analysis
  • plotly 5.18.0 - Interactive visualization library
  • numpy 1.26.3 - Numerical computing
  • scipy 1.12.0 - Scientific computing algorithms
  • Flask 3.0.1 - Web server backend
The setup also creates a symbolic link for easy command access:
WORKDIR $BIN
RUN ln -s protege.py protege-pd
WORKDIR $HOME

Container Runtime Configuration

Port Mapping

The application runs a Dash web server on port 8050:
-p 127.0.0.1:8050:8050
  • 127.0.0.1 - Binds to localhost only (security best practice)
  • 8050 (first) - Host machine port
  • 8050 (second) - Container internal port
This configuration ensures the web interface is only accessible from your local machine, not from external networks.
Changing 127.0.0.1 to 0.0.0.0 would expose the application to your entire network. Only do this if you understand the security implications.

Volume Mounts

PROTÉGÉ PD uses bind mounts to access your FASTA files:
--mount type=bind,source=/your/files/path/,target=/root/.
Bind mount parameters:
  • type=bind - Creates a direct mount of a host directory
  • source - Absolute path on your host machine (e.g., /home/user/data/)
  • target - Mount location inside container (/root/.)
docker run --rm \
  --mount type=bind,source=/home/user/sequences/,target=/root/. \
  --name protege -p 127.0.0.1:8050:8050 --cpus 4 \
  ddelgadillo/protege_base:v1.0.2 \
  protege-pd -s myseqs.fna

CPU Allocation

The --cpus flag limits CPU resources:
--cpus 4
Recommended CPU allocation:
  • Small datasets (under 50 sequences): 2 CPUs
  • Medium datasets (50-200 sequences): 4 CPUs
  • Large datasets (over 200 sequences): 6-8 CPUs
MUSCLE alignment is CPU-intensive and benefits from multiple cores.

Container Cleanup

The --rm flag automatically removes the container after it stops:
--rm
This prevents accumulation of stopped containers and saves disk space.

Building Custom Images

You can build a custom image with modifications to the codebase:

Clone and Modify

git clone https://github.com/ddelgadillod/ProtegePD.git
cd ProtegePD
# Make your modifications to the code

Build Custom Image

docker build -t protege-custom:latest .
Build arguments:
  • -t protege-custom:latest - Tag the image with a name and version
  • . - Build context (current directory)

Run Custom Image

docker run --rm \
  --mount type=bind,source=$(pwd)/data,target=/root/. \
  --name protege -p 127.0.0.1:8050:8050 --cpus 4 \
  protege-custom:latest \
  protege-pd -s sequences.fna

Environment Variables

While PROTÉGÉ PD doesn’t require custom environment variables, you can pass them if needed:
docker run --rm \
  -e CONSENSUS_PERCENTAGE=85 \
  -e CODON_LENGTH=6 \
  --mount type=bind,source=/path/to/data,target=/root/. \
  --name protege -p 127.0.0.1:8050:8050 --cpus 4 \
  ddelgadillo/protege_base:v1.0.2 \
  protege-pd -s myseqs.fna -c 85 -d 6
Environment variables are not currently used by the application but can be helpful for scripting and automation.

Container Resource Monitoring

Check Container Resources

While PROTÉGÉ PD is running, monitor resource usage:
docker stats protege
This shows real-time CPU, memory, network, and disk I/O statistics.

Access Container Shell

For debugging, you can access a shell inside the running container:
docker exec -it protege bash
# Check MUSCLE version
muscle_lin -version

# View Python packages
pip3 list

# Check file locations
ls -la /usr/local/bin/

# Monitor processes
htop

Multi-Stage Builds (Advanced)

For a smaller image size, you can create a multi-stage Dockerfile:
# Build stage
FROM ubuntu:22.04 AS builder
ENV SRC /usr/local/src/
RUN apt-get update && apt-get install -y python3.10 python3-pip git wget
WORKDIR $SRC
RUN git clone https://github.com/ddelgadillod/ProtegePD
RUN pip3 install --user -r ProtegePD/requirements.txt

# Production stage
FROM ubuntu:22.04
ENV BIN /usr/local/bin/
ENV HOME /root/
RUN apt-get update && apt-get install -y python3.10
COPY --from=builder /root/.local /root/.local
COPY --from=builder /usr/local/src/ProtegePD/*.py $BIN/
COPY --from=builder /usr/local/src/ProtegePD/assets $BIN/assets
WORKDIR $HOME
CMD ["protege-pd", "--help"]
Multi-stage builds reduce image size but require more complex debugging. Use only if disk space is critical.

Build docs developers (and LLMs) love