Docker Configuration

Dockerfile Architecture

PROTÉGÉ PD’s Docker configuration is built on Ubuntu 22.04 and provides a complete Python 3.10 environment with all necessary bioinformatics dependencies.

Base Image and Environment

The Dockerfile starts with Ubuntu 22.04 LTS and sets up three key environment variables:

FROM ubuntu:22.04

ENV SRC /usr/local/src/
ENV BIN /usr/local/bin/
ENV HOME /root/

These environment variables define standard locations for source code (/usr/local/src/), binaries (/usr/local/bin/), and the home directory (/root/).

System Dependencies

The container installs essential system packages:

RUN apt-get update && apt-get install -y python3.10 python3-pip
RUN apt-get install -y git
RUN apt-get install -y vim
RUN apt-get install -y htop
RUN apt-get install -y wget

Installed packages:

python3.10 - Required Python version for Biopython compatibility
python3-pip - Package manager for Python dependencies
git - Version control for cloning the repository
vim - Text editor for container debugging
htop - Process monitoring tool
wget - Download utility for MUSCLE binary

MUSCLE Binary Installation

PROTÉGÉ PD requires MUSCLE (Multiple Sequence Comparison by Log-Expectation) for protein sequence alignment:

WORKDIR $BIN
RUN wget https://github.com/ddelgadillod/ProtegePD/raw/main/muscle/muscle_lin
RUN chmod +x muscle_lin

The muscle_lin binary is the Linux version of MUSCLE v3.8.31. The container makes it executable and places it in /usr/local/bin/ for global access.

Application Setup

The Dockerfile clones the repository and installs all Python dependencies:

WORKDIR $SRC
RUN git clone https://github.com/ddelgadillod/ProtegePD
RUN pip3 install -r ProtegePD/requirements.txt
RUN cp ProtegePD/*.py $BIN
RUN cp -r ProtegePD/assets $BIN
RUN cp -r ProtegePD/test_files $HOME

Python dependencies (from requirements.txt):

biopython 1.83 - Core bioinformatics library for sequence manipulation
dash 2.14.2 - Web application framework for the GUI
pandas 2.2.0 - Data manipulation and analysis
plotly 5.18.0 - Interactive visualization library
numpy 1.26.3 - Numerical computing
scipy 1.12.0 - Scientific computing algorithms
Flask 3.0.1 - Web server backend

The setup also creates a symbolic link for easy command access:

WORKDIR $BIN
RUN ln -s protege.py protege-pd
WORKDIR $HOME

Container Runtime Configuration

Port Mapping

The application runs a Dash web server on port 8050:

-p 127.0.0.1:8050:8050

Port Mapping Breakdown

127.0.0.1 - Binds to localhost only (security best practice)
8050 (first) - Host machine port
8050 (second) - Container internal port

This configuration ensures the web interface is only accessible from your local machine, not from external networks.

Changing 127.0.0.1 to 0.0.0.0 would expose the application to your entire network. Only do this if you understand the security implications.

Volume Mounts

PROTÉGÉ PD uses bind mounts to access your FASTA files:

--mount type=bind,source=/your/files/path/,target=/root/.

Bind mount parameters:

type=bind - Creates a direct mount of a host directory
source - Absolute path on your host machine (e.g., /home/user/data/)
target - Mount location inside container (/root/.)

docker run --rm \
  --mount type=bind,source=/home/user/sequences/,target=/root/. \
  --name protege -p 127.0.0.1:8050:8050 --cpus 4 \
  ddelgadillo/protege_base:v1.0.2 \
  protege-pd -s myseqs.fna

CPU Allocation

The --cpus flag limits CPU resources:

--cpus 4

Recommended CPU allocation:

Small datasets (under 50 sequences): 2 CPUs
Medium datasets (50-200 sequences): 4 CPUs
Large datasets (over 200 sequences): 6-8 CPUs

MUSCLE alignment is CPU-intensive and benefits from multiple cores.

Container Cleanup

The --rm flag automatically removes the container after it stops:

--rm

This prevents accumulation of stopped containers and saves disk space.

Building Custom Images

You can build a custom image with modifications to the codebase:

Clone and Modify

git clone https://github.com/ddelgadillod/ProtegePD.git
cd ProtegePD
# Make your modifications to the code

Build Custom Image

docker build -t protege-custom:latest .

Build arguments:

-t protege-custom:latest - Tag the image with a name and version
. - Build context (current directory)

Run Custom Image

docker run --rm \
  --mount type=bind,source=$(pwd)/data,target=/root/. \
  --name protege -p 127.0.0.1:8050:8050 --cpus 4 \
  protege-custom:latest \
  protege-pd -s sequences.fna

Environment Variables

While PROTÉGÉ PD doesn’t require custom environment variables, you can pass them if needed:

docker run --rm \
  -e CONSENSUS_PERCENTAGE=85 \
  -e CODON_LENGTH=6 \
  --mount type=bind,source=/path/to/data,target=/root/. \
  --name protege -p 127.0.0.1:8050:8050 --cpus 4 \
  ddelgadillo/protege_base:v1.0.2 \
  protege-pd -s myseqs.fna -c 85 -d 6

Environment variables are not currently used by the application but can be helpful for scripting and automation.

Container Resource Monitoring

Check Container Resources

While PROTÉGÉ PD is running, monitor resource usage:

docker stats protege

This shows real-time CPU, memory, network, and disk I/O statistics.

Access Container Shell

For debugging, you can access a shell inside the running container:

docker exec -it protege bash

Useful Container Commands

# Check MUSCLE version
muscle_lin -version

# View Python packages
pip3 list

# Check file locations
ls -la /usr/local/bin/

# Monitor processes
htop

Multi-Stage Builds (Advanced)

For a smaller image size, you can create a multi-stage Dockerfile:

# Build stage
FROM ubuntu:22.04 AS builder
ENV SRC /usr/local/src/
RUN apt-get update && apt-get install -y python3.10 python3-pip git wget
WORKDIR $SRC
RUN git clone https://github.com/ddelgadillod/ProtegePD
RUN pip3 install --user -r ProtegePD/requirements.txt

# Production stage
FROM ubuntu:22.04
ENV BIN /usr/local/bin/
ENV HOME /root/
RUN apt-get update && apt-get install -y python3.10
COPY --from=builder /root/.local /root/.local
COPY --from=builder /usr/local/src/ProtegePD/*.py $BIN/
COPY --from=builder /usr/local/src/ProtegePD/assets $BIN/assets
WORKDIR $HOME
CMD ["protege-pd", "--help"]

Multi-stage builds reduce image size but require more complex debugging. Use only if disk space is critical.

Getting Started

Usage Guide

Core Concepts

Web Interface

Advanced

Dockerfile Architecture

Base Image and Environment

System Dependencies

MUSCLE Binary Installation

Application Setup

Container Runtime Configuration

Port Mapping

Volume Mounts

CPU Allocation

Container Cleanup

Building Custom Images

Clone and Modify

Build Custom Image

Run Custom Image

Environment Variables

Container Resource Monitoring

Check Container Resources

Access Container Shell

Multi-Stage Builds (Advanced)

Build docs developers (and LLMs) love

Getting Started

Usage Guide

Core Concepts

Web Interface

Advanced

​Dockerfile Architecture

​Base Image and Environment

​System Dependencies

​MUSCLE Binary Installation

​Application Setup

​Container Runtime Configuration

​Port Mapping

​Volume Mounts

​CPU Allocation

​Container Cleanup

​Building Custom Images

​Clone and Modify

​Build Custom Image

​Run Custom Image

​Environment Variables

​Container Resource Monitoring

​Check Container Resources

​Access Container Shell

​Multi-Stage Builds (Advanced)

Build docs developers (and LLMs) love

Dockerfile Architecture

Base Image and Environment

System Dependencies

MUSCLE Binary Installation

Application Setup

Container Runtime Configuration

Port Mapping

Volume Mounts

CPU Allocation

Container Cleanup

Building Custom Images

Clone and Modify

Build Custom Image

Run Custom Image

Environment Variables

Container Resource Monitoring

Check Container Resources

Access Container Shell

Multi-Stage Builds (Advanced)