Running Pipeline in Stages
AlphaFold 3 can be executed in stages, separating the CPU-intensive data pipeline from the GPU-intensive inference. This enables optimal resource utilization and efficient reuse of computed MSAs and templates.
Overview
The complete AlphaFold 3 workflow consists of two main stages:
Data Pipeline (CPU-only)
Generate Multiple Sequence Alignments (MSAs) and search for structural templates using genetic databases.Resource Requirements:
- High CPU utilization
- Significant RAM (64+ GB recommended)
- Fast disk I/O (SSD recommended)
- No GPU required
Featurization & Inference (GPU)
Convert processed data into features and run the neural network model to predict structures.Resource Requirements:
- GPU (A100 80GB or H100 80GB recommended)
- Moderate CPU
- Moderate RAM
Why Run in Stages?
Run the expensive data pipeline on cheaper CPU-only machines, then move to GPU machines only for inference.# On CPU machine (no GPU needed)
python run_alphafold.py \
--json_path=input.json \
--output_dir=output \
--norun_inference
# On GPU machine (reuse computed MSA/templates)
python run_alphafold.py \
--json_path=output/fold_input.json \
--output_dir=output \
--norun_data_pipeline
Compute MSAs once, then run multiple inference variations (different seeds, added ligands, partner chains).# Compute MSAs once
python run_alphafold.py \
--json_path=protein_a.json \
--norun_inference
# Run multiple inferences with different seeds
for seed in 1 2 3 4 5; do
# Modify seed in JSON
python run_alphafold.py \
--json_path=protein_a_with_msa.json \
--norun_data_pipeline
done
Compute MSAs for individual chains once, then efficiently fold all pairwise combinations.See the Batch Processing guide for details.
Stage 1: Data Pipeline Only
Run the data pipeline without inference:
python run_alphafold.py \
--json_path=input.json \
--output_dir=output \
--norun_inference
What Happens
Input Parsing
The input JSON is parsed and validated.
MSA Generation
For each protein chain:
- Jackhmmer searches against UniRef90, MGnify, BFD, UniProt
- Paired and unpaired MSAs are generated
For each RNA chain:
- Nhmmer searches against NT-RNA, Rfam, RNACentral
- Unpaired MSAs are generated
Template Search
For each protein chain:
- Hmmsearch against PDB70 or structure databases
- Top templates are selected and processed
Output Generation
Augmented JSON with MSAs and templates is written to:output/fold_<job_name>_input.json
Output Structure
After data pipeline:
output/
└── fold_my_protein_input.json # Augmented JSON with MSAs and templates
The augmented JSON includes:
{
"name": "my_protein",
"sequences": [
{
"protein": {
"id": "A",
"sequence": "MQIFVKTLTGKTITLEVEPS",
"unpairedMsa": ">query\nMQIFVKTLTGKTITLEVEPS\n>hit1\n...",
"pairedMsa": ">query\nMQIFVKTLTGKTITLEVEPS\n...",
"templates": [
{
"mmcif": "data_template\n...",
"queryIndices": [0, 1, 2, ...],
"templateIndices": [0, 1, 2, ...]
}
]
}
}
],
"modelSeeds": [42],
"dialect": "alphafold3",
"version": 4
}
This augmented JSON can be used directly as input for inference-only runs.
Stage 2: Inference Only
Run inference using pre-computed MSAs and templates:
python run_alphafold.py \
--json_path=output/fold_my_protein_input.json \
--output_dir=output \
--norun_data_pipeline
Requirements
The input JSON must contain pre-computed MSAs and templates:
- For protein chains:
unpairedMsa, pairedMsa, and templates must be set
- For RNA chains:
unpairedMsa must be set
- Empty strings are valid (for MSA-free or template-free predictions)
null values will cause an error
What Happens
Input Validation
Verifies that all required MSA and template fields are present.
Featurization
Converts sequences, MSAs, and templates into neural network input features.
Model Inference
Runs the AlphaFold 3 model on GPU for each specified seed.
Output Generation
Generates prediction outputs (CIF files, confidence metrics, JSON summaries).
Output Structure
After inference:
output/
├── fold_my_protein_input.json
├── fold_my_protein_model.cif # Predicted structure
├── fold_my_protein_summary_confidences.json
└── fold_my_protein_data.json
Advanced: Pre-computing for Multimers
For efficient multimer screening, compute MSAs for individual chains once, then combine them:
Step 1: Compute Individual MSAs
Step 2: Create Dimer JSONs
Step 3: Run Inference Only
# Chain A
python run_alphafold.py \
--json_path=chain_a.json \
--output_dir=msas \
--norun_inference
# Chain B
python run_alphafold.py \
--json_path=chain_b.json \
--output_dir=msas \
--norun_inference
# Chain C
python run_alphafold.py \
--json_path=chain_c.json \
--output_dir=msas \
--norun_inference
Combine MSAs from individual chains:{
"name": "dimer_ab",
"sequences": [
{
"protein": {
"id": "A",
"sequence": "...",
"unpairedMsa": "<from fold_chain_a_input.json>",
"pairedMsa": "<from fold_chain_a_input.json>",
"templates": []
}
},
{
"protein": {
"id": "B",
"sequence": "...",
"unpairedMsa": "<from fold_chain_b_input.json>",
"pairedMsa": "<from fold_chain_b_input.json>",
"templates": []
}
}
],
"modelSeeds": [42],
"dialect": "alphafold3",
"version": 4
}
# AB dimer
python run_alphafold.py \
--json_path=dimer_ab.json \
--output_dir=dimers \
--norun_data_pipeline
# AC dimer
python run_alphafold.py \
--json_path=dimer_ac.json \
--output_dir=dimers \
--norun_data_pipeline
# BC dimer
python run_alphafold.py \
--json_path=dimer_bc.json \
--output_dir=dimers \
--norun_data_pipeline
Instead of 6 full runs, you only need 3 data pipeline runs + 3 inference runs.
Combinatorial Efficiency
For n first chains and m second chains:
- Without stages: n × m full runs
- With stages: n + m data pipeline runs + n × m inference runs
For 10 chains × 10 chains:
- Without stages: 100 full runs
- With stages: 20 data pipeline + 100 inference (much faster!)
MSA-Free and Template-Free Modes
You can skip data pipeline stages by providing empty MSAs/templates:
Completely MSA-Free
Template-Free Only
Custom MSA, No Templates
{
"protein": {
"id": "A",
"sequence": "MQIFVKTLTGKTITLEVEPS",
"unpairedMsa": "",
"pairedMsa": "",
"templates": []
}
}
Run with --norun_data_pipeline.{
"protein": {
"id": "A",
"sequence": "MQIFVKTLTGKTITLEVEPS",
"templates": []
}
}
Run normally (will compute MSAs but not templates).{
"protein": {
"id": "A",
"sequence": "MQIFVKTLTGKTITLEVEPS",
"unpairedMsa": "<your MSA>",
"pairedMsa": "",
"templates": []
}
}
Run with --norun_data_pipeline.
From performance.md:70-84:
Data pipeline runtime varies significantly based on:
- Input size
- Number of homologous sequences
- Available hardware (CPU cores, disk speed)
- Database size and sharding
For deep MSAs, Jackhmmer/Nhmmer may need substantial RAM beyond 64 GB.
Optimization Tips
Use Fast Storage
Place databases on fast SSD or RAM-backed filesystem:# Create RAM disk (Linux)
sudo mkdir /mnt/ramdisk
sudo mount -t tmpfs -o size=512G tmpfs /mnt/ramdisk
cp -r /path/to/databases/* /mnt/ramdisk/
Increase Parallelization
python run_alphafold.py \
--json_path=input.json \
--jackhmmer_n_cpu=8 \
--nhmmer_n_cpu=8 \
--norun_inference
Complete Example Workflow
# Step 1: Data pipeline only (on CPU machine)
python run_alphafold.py \
--json_path=input.json \
--output_dir=pipeline_output \
--db_dir=/databases \
--norun_inference
# Transfer output to GPU machine
scp pipeline_output/fold_my_protein_input.json gpu_machine:/inference_input/
# Step 2: Inference only (on GPU machine)
python run_alphafold.py \
--json_path=/inference_input/fold_my_protein_input.json \
--output_dir=/inference_output \
--model_dir=/models \
--norun_data_pipeline
# Step 3: Modify seed and run again (reusing MSAs)
python run_alphafold.py \
--json_path=/inference_input/fold_my_protein_input.json \
--output_dir=/inference_output_seed2 \
--model_dir=/models \
--norun_data_pipeline
Code Reference
From performance.md:3-18:
## Running the Pipeline in Stages
The `run_alphafold.py` script can be executed in stages to optimise
resource utilisation. This can be useful for:
1. Splitting the CPU-only data pipeline from model inference (which
requires a GPU), to optimise cost and resource usage.
2. Generating the JSON output file from the data pipeline only run
and then using it for multiple different inference only runs
across seeds or across variations of other features.
3. Generating the JSON output for multiple individual monomer chains,
then running the inference on all possible chain pairs.