Overview
The pipeline.py module runs MSA (Multiple Sequence Alignment) generation and template search tools for AlphaFold 3. It processes protein and RNA chains to generate evolutionary information and structural templates needed for structure prediction.
DataPipeline Class
Main class that orchestrates MSA generation and template search.
class DataPipeline:
def __init__(self, data_pipeline_config: DataPipelineConfig)
data_pipeline_config
DataPipelineConfig
required
Configuration specifying database paths, binary paths, and search parameters.
Methods
process
Main method to process a fold input through the data pipeline.
def process(
self,
fold_input: folding_input.Input
) -> folding_input.Input
fold_input
folding_input.Input
required
Input containing chains to process. MSA and template fields should be None or empty.
New Input with MSAs and templates populated for all chains.
Processing Logic:
- Protein chains: Runs Jackhmmer for MSA, Hmmsearch for templates
- RNA chains: Runs Nhmmer for MSA
- DNA chains: No processing (passed through)
- Ligands: No processing (passed through)
process_protein_chain
def process_protein_chain(
self,
chain: folding_input.ProteinChain
) -> folding_input.ProteinChain
Processes a single protein chain to generate MSAs and templates.
chain
folding_input.ProteinChain
required
Protein chain to process.
return
folding_input.ProteinChain
Protein chain with populated unpaired_msa, paired_msa, and templates fields.
MSA Generation:
- UniRef90: 10,000 sequences max, e-value 1e-4
- Mgnify: 5,000 sequences max, e-value 1e-4
- Small BFD: 5,000 sequences max, e-value 1e-4
- UniProt (paired): 50,000 sequences max, e-value 1e-4
Template Search:
- Searches PDB using Hmmsearch with e-value 100
- Filters to max 4 templates by date and quality
- Returns templates with structures and alignments
process_rna_chain
def process_rna_chain(
self,
chain: folding_input.RnaChain
) -> folding_input.RnaChain
Processes a single RNA chain to generate MSAs.
chain
folding_input.RnaChain
required
RNA chain to process.
RNA chain with populated unpaired_msa field.
MSA Generation:
- NT-RNA: 10,000 sequences max, e-value 1e-3
- Rfam: 10,000 sequences max, e-value 1e-3
- RNAcentral: 10,000 sequences max, e-value 1e-3
DataPipelineConfig
Configuration dataclass specifying all pipeline settings.
@dataclasses.dataclass(frozen=True, slots=True, kw_only=True)
class DataPipelineConfig:
# Binary paths
jackhmmer_binary_path: str
nhmmer_binary_path: str
hmmalign_binary_path: str
hmmsearch_binary_path: str
hmmbuild_binary_path: str
# Database paths
small_bfd_database_path: str
mgnify_database_path: str
uniprot_cluster_annot_database_path: str
uniref90_database_path: str
ntrna_database_path: str
rfam_database_path: str
rna_central_database_path: str
seqres_database_path: str
pdb_database_path: str
# Optional Z-values for sharded databases
small_bfd_z_value: int | None = None
mgnify_z_value: int | None = None
uniprot_cluster_annot_z_value: int | None = None
uniref90_z_value: int | None = None
ntrna_z_value: int | None = None
rfam_z_value: int | None = None
rna_central_z_value: int | None = None
# CPU configuration
jackhmmer_n_cpu: int = 8
jackhmmer_max_parallel_shards: int | None = None
nhmmer_n_cpu: int = 8
nhmmer_max_parallel_shards: int | None = None
# Template search
max_template_date: datetime.date
Binary Paths
Path to Jackhmmer binary for protein MSA search.
Path to Nhmmer binary for RNA MSA search.
Path to Hmmalign binary for aligning hits to query profile.
Path to Hmmsearch binary for template search.
Path to Hmmbuild binary for building HMM profiles.
Database Paths
Small BFD database path for protein MSA search.
Mgnify database path for protein MSA search.
uniprot_cluster_annot_database_path
UniProt database path for protein paired MSA search.
UniRef90 database path for MSA and template profile construction.
NT-RNA database path for RNA MSA search.
Rfam database path for RNA MSA search.
rna_central_database_path
RNAcentral database path for RNA MSA search.
PDB sequence database path for template search.
PDB mmCIF files directory for template structures.
Z-values
Z-values represent database sizes for E-value calculation and must be set for sharded databases.
Database size in number of sequences for Small BFD.
Database size in number of sequences for Mgnify.
uniprot_cluster_annot_z_value
Database size in number of sequences for UniProt.
Database size in number of sequences for UniRef90.
Database size in megabases for NT-RNA.
Database size in megabases for Rfam.
Database size in megabases for RNAcentral.
CPU Configuration
Number of CPUs for Jackhmmer. Going above 8 provides diminishing returns.
jackhmmer_max_parallel_shards
Maximum parallel shards for Jackhmmer. If None, one instance per shard.
Number of CPUs for Nhmmer. Going above 8 provides diminishing returns.
nhmmer_max_parallel_shards
Maximum parallel shards for Nhmmer. If None, one instance per shard.
Template Configuration
Latest allowed template release date. Templates after this date are filtered out.
Internal Functions
_get_protein_msa_and_templates
Cached function to avoid re-running MSA tools for identical sequences in homomers.
@functools.cache
def _get_protein_msa_and_templates(
sequence: str,
run_template_search: bool,
uniref90_msa_config: msa_config.RunConfig,
mgnify_msa_config: msa_config.RunConfig,
small_bfd_msa_config: msa_config.RunConfig,
uniprot_msa_config: msa_config.RunConfig,
templates_config: msa_config.TemplatesConfig,
pdb_database_path: str,
) -> tuple[msa.Msa, msa.Msa, templates_lib.Templates]
Returns unpaired MSA, paired MSA, and templates for a protein sequence.
_get_protein_templates
Cached function for template search only.
@functools.cache
def _get_protein_templates(
sequence: str,
input_msa_a3m: str,
run_template_search: bool,
templates_config: msa_config.TemplatesConfig,
pdb_database_path: str,
) -> templates_lib.Templates
Searches for templates using provided MSA.
_get_rna_msa
Cached function for RNA MSA generation.
@functools.cache
def _get_rna_msa(
sequence: str,
nt_rna_msa_config: msa_config.NhmmerConfig,
rfam_msa_config: msa_config.NhmmerConfig,
rnacentral_msa_config: msa_config.NhmmerConfig,
) -> msa.Msa
Generates and deduplicates RNA MSA from three databases.
Usage Examples
Basic Pipeline Usage
import datetime
from alphafold3.data import pipeline
from alphafold3.common import folding_input
# Create configuration
config = pipeline.DataPipelineConfig(
# Binary paths
jackhmmer_binary_path='/usr/bin/jackhmmer',
nhmmer_binary_path='/usr/bin/nhmmer',
hmmalign_binary_path='/usr/bin/hmmalign',
hmmsearch_binary_path='/usr/bin/hmmsearch',
hmmbuild_binary_path='/usr/bin/hmmbuild',
# Database paths
small_bfd_database_path='/databases/bfd/bfd.fasta',
mgnify_database_path='/databases/mgnify/mgy_clusters.fa',
uniprot_cluster_annot_database_path='/databases/uniprot/uniprot.fa',
uniref90_database_path='/databases/uniref90/uniref90.fa',
ntrna_database_path='/databases/ntrna/nt_rna.fasta',
rfam_database_path='/databases/rfam/rfam.fasta',
rna_central_database_path='/databases/rnacentral/rnacentral.fasta',
seqres_database_path='/databases/pdb/seqres.fasta',
pdb_database_path='/databases/pdb/mmcif_files',
# Settings
jackhmmer_n_cpu=8,
nhmmer_n_cpu=8,
max_template_date=datetime.date(2021, 9, 30),
)
# Initialize pipeline
data_pipeline = pipeline.DataPipeline(config)
# Create input
protein = folding_input.ProteinChain(
id='A',
sequence='MKFLKFSLLTAVLLSVVFAFSSCGDDDDTYPYDVPDYA',
ptms=[]
)
fold_input = folding_input.Input(
name='my_protein',
chains=[protein],
rng_seeds=[42]
)
# Run pipeline
processed_input = data_pipeline.process(fold_input)
# Access results
processed_chain = processed_input.protein_chains[0]
print(f"Unpaired MSA depth: {processed_chain.unpaired_msa.count('>')}")
print(f"Paired MSA depth: {processed_chain.paired_msa.count('>')}")
print(f"Templates found: {len(processed_chain.templates)}")
Processing Individual Chains
# Process only protein chain
protein_chain = folding_input.ProteinChain(
id='A',
sequence='MKFLKFSLLTAVLLSVVFAFSSCGDDDDTYPYDVPDYA',
ptms=[]
)
processed_protein = data_pipeline.process_protein_chain(protein_chain)
# Process only RNA chain
rna_chain = folding_input.RnaChain(
id='B',
sequence='GGGGCUAUAGCUCAGCGGUAGAGCAGUGGAUUGAAAUCCAUUGUGUCGCUGGUUCGAUUCCGGUUAGUCUCCA',
modifications=[]
)
processed_rna = data_pipeline.process_rna_chain(rna_chain)
Custom MSA (Skip Pipeline)
# Provide custom MSA to skip pipeline
protein_with_msa = folding_input.ProteinChain(
id='A',
sequence='MKFLKFSLLTAVLLSVVFAFSSCGDDDDTYPYDVPDYA',
ptms=[],
unpaired_msa='>query\nMKFLKFSLLTAVLLSVVFAFSSCGDDDDTYPYDVPDYA\n>seq1\nMKFLKFSLLTAVLLSVVFAFSSCGDDDDTYPYDVPDYA\n',
paired_msa='', # Empty paired MSA
templates=[] # No templates
)
# Pipeline will skip this chain
fold_input = folding_input.Input(
name='custom_msa',
chains=[protein_with_msa],
rng_seeds=[42]
)
# No processing will occur
processed = data_pipeline.process(fold_input)
Multi-Chain Complex
# Create complex with protein and RNA
protein = folding_input.ProteinChain(
id='A',
sequence='MKFLKFSLLTAVLLSVVFAFSSCGDDDDTYPYDVPDYA',
ptms=[]
)
rna = folding_input.RnaChain(
id='B',
sequence='GGGGCUAUAGCUCAGCGGUAGAGCAGUGGAUUGAAAUCCAUUGUGUCGCUGGUUCGAUUCCGGUUAGUCUCCA',
modifications=[]
)
ligand = folding_input.Ligand(
id='C',
ccd_ids=['ATP']
)
fold_input = folding_input.Input(
name='complex',
chains=[protein, rna, ligand],
rng_seeds=[42]
)
# Pipeline processes protein and RNA, passes through ligand
processed = data_pipeline.process(fold_input)
Sharded Database Configuration
# For sharded databases, specify Z-values
config = pipeline.DataPipelineConfig(
# ... other paths ...
# Sharded database with Z-value
small_bfd_database_path='/databases/bfd/bfd_{000..049}.fasta',
small_bfd_z_value=138_515_945, # Total sequences across all shards
uniref90_database_path='/databases/uniref90/uniref90_{000..019}.fasta',
uniref90_z_value=103_000_000,
# Parallel shard processing
jackhmmer_max_parallel_shards=4, # Process 4 shards at once
max_template_date=datetime.date(2021, 9, 30),
)
MSAs are returned in A3M format:
>query
MKFLKFSLLTAVLLSVVFAFSSCGDDDDTYPYDVPDYA
>seq1/1-39
MKFLKFSLLTAVLLSVVFAFSSCGDDDDTYPYDVPDYA
>seq2/1-39
MKFLK-SLLTAVLLSVVFAFSSCGDDDDTYPYDVPDYA
>seq3/1-39
mkflkfslltavllsvvfafsscgddddtypydvpdya
- First sequence is query (uppercase, no gaps)
- Subsequent sequences are hits (lowercase = insertion,
- = deletion)
- Headers include sequence ID and alignment range
Templates are returned as folding_input.Template objects:
template = folding_input.Template(
mmcif='<mmCIF string>',
query_to_template_map={
1: 10, # Query residue 1 -> Template residue 10
2: 11, # Query residue 2 -> Template residue 11
3: 12, # Query residue 3 -> Template residue 12
# ... etc
}
)
Caching
Functions are decorated with @functools.cache to avoid redundant searches:
- Identical sequences in homomers are processed only once
- Cache is per-Python-process (not persistent across runs)
Parallelization
MSA tools run in parallel using ThreadPoolExecutor:
- 4 protein databases searched simultaneously
- 3 RNA databases searched simultaneously
- Template search can run concurrently with MSA
Timing
Typical processing times (8 CPUs):
- Protein MSA: 5-30 minutes depending on databases
- RNA MSA: 3-15 minutes depending on databases
- Template search: 1-5 minutes
- Total for protein: 10-35 minutes
Database Sizes
Recommended database sizes (2021-2022 versions):
- Small BFD: ~138M sequences, ~2.5 TB
- Mgnify: ~125M sequences, ~120 GB
- UniRef90: ~103M sequences, ~35 GB
- UniProt: ~224M sequences, ~80 GB
- NT-RNA: ~47 MB
- Rfam: ~13 MB
- RNAcentral: ~42 MB
See Also