pipeline.py

Overview

The pipeline.py module runs MSA (Multiple Sequence Alignment) generation and template search tools for AlphaFold 3. It processes protein and RNA chains to generate evolutionary information and structural templates needed for structure prediction.

DataPipeline Class

Main class that orchestrates MSA generation and template search.

class DataPipeline:
    def __init__(self, data_pipeline_config: DataPipelineConfig)

data_pipeline_config

DataPipelineConfig

required

Configuration specifying database paths, binary paths, and search parameters.

Methods

process

Main method to process a fold input through the data pipeline.

def process(
    self, 
    fold_input: folding_input.Input
) -> folding_input.Input

fold_input

folding_input.Input

required

Input containing chains to process. MSA and template fields should be None or empty.

return

folding_input.Input

New Input with MSAs and templates populated for all chains.

Processing Logic:

Protein chains: Runs Jackhmmer for MSA, Hmmsearch for templates
RNA chains: Runs Nhmmer for MSA
DNA chains: No processing (passed through)
Ligands: No processing (passed through)

process_protein_chain

def process_protein_chain(
    self, 
    chain: folding_input.ProteinChain
) -> folding_input.ProteinChain

Processes a single protein chain to generate MSAs and templates.

chain

folding_input.ProteinChain

required

Protein chain to process.

return

folding_input.ProteinChain

Protein chain with populated unpaired_msa, paired_msa, and templates fields.

MSA Generation:

UniRef90: 10,000 sequences max, e-value 1e-4
Mgnify: 5,000 sequences max, e-value 1e-4
Small BFD: 5,000 sequences max, e-value 1e-4
UniProt (paired): 50,000 sequences max, e-value 1e-4

Template Search:

Searches PDB using Hmmsearch with e-value 100
Filters to max 4 templates by date and quality
Returns templates with structures and alignments

process_rna_chain

def process_rna_chain(
    self, 
    chain: folding_input.RnaChain
) -> folding_input.RnaChain

Processes a single RNA chain to generate MSAs.

chain

folding_input.RnaChain

required

RNA chain to process.

return

folding_input.RnaChain

RNA chain with populated unpaired_msa field.

MSA Generation:

NT-RNA: 10,000 sequences max, e-value 1e-3
Rfam: 10,000 sequences max, e-value 1e-3
RNAcentral: 10,000 sequences max, e-value 1e-3

DataPipelineConfig

Configuration dataclass specifying all pipeline settings.

@dataclasses.dataclass(frozen=True, slots=True, kw_only=True)
class DataPipelineConfig:
    # Binary paths
    jackhmmer_binary_path: str
    nhmmer_binary_path: str
    hmmalign_binary_path: str
    hmmsearch_binary_path: str
    hmmbuild_binary_path: str
    
    # Database paths
    small_bfd_database_path: str
    mgnify_database_path: str
    uniprot_cluster_annot_database_path: str
    uniref90_database_path: str
    ntrna_database_path: str
    rfam_database_path: str
    rna_central_database_path: str
    seqres_database_path: str
    pdb_database_path: str
    
    # Optional Z-values for sharded databases
    small_bfd_z_value: int | None = None
    mgnify_z_value: int | None = None
    uniprot_cluster_annot_z_value: int | None = None
    uniref90_z_value: int | None = None
    ntrna_z_value: int | None = None
    rfam_z_value: int | None = None
    rna_central_z_value: int | None = None
    
    # CPU configuration
    jackhmmer_n_cpu: int = 8
    jackhmmer_max_parallel_shards: int | None = None
    nhmmer_n_cpu: int = 8
    nhmmer_max_parallel_shards: int | None = None
    
    # Template search
    max_template_date: datetime.date

Binary Paths

jackhmmer_binary_path

str

required

Path to Jackhmmer binary for protein MSA search.

nhmmer_binary_path

str

required

Path to Nhmmer binary for RNA MSA search.

hmmalign_binary_path

str

required

Path to Hmmalign binary for aligning hits to query profile.

hmmsearch_binary_path

str

required

Path to Hmmsearch binary for template search.

hmmbuild_binary_path

str

required

Path to Hmmbuild binary for building HMM profiles.

Database Paths

small_bfd_database_path

str

required

Small BFD database path for protein MSA search.

mgnify_database_path

str

required

Mgnify database path for protein MSA search.

uniprot_cluster_annot_database_path

str

required

UniProt database path for protein paired MSA search.

uniref90_database_path

str

required

UniRef90 database path for MSA and template profile construction.

ntrna_database_path

str

required

NT-RNA database path for RNA MSA search.

rfam_database_path

str

required

Rfam database path for RNA MSA search.

rna_central_database_path

str

required

RNAcentral database path for RNA MSA search.

seqres_database_path

str

required

PDB sequence database path for template search.

pdb_database_path

str

required

PDB mmCIF files directory for template structures.

Z-values

Z-values represent database sizes for E-value calculation and must be set for sharded databases.

small_bfd_z_value

int | None

Database size in number of sequences for Small BFD.

mgnify_z_value

int | None

Database size in number of sequences for Mgnify.

uniprot_cluster_annot_z_value

int | None

Database size in number of sequences for UniProt.

uniref90_z_value

int | None

Database size in number of sequences for UniRef90.

ntrna_z_value

int | None

Database size in megabases for NT-RNA.

rfam_z_value

int | None

Database size in megabases for Rfam.

rna_central_z_value

int | None

Database size in megabases for RNAcentral.

CPU Configuration

jackhmmer_n_cpu

int

default:"8"

Number of CPUs for Jackhmmer. Going above 8 provides diminishing returns.

jackhmmer_max_parallel_shards

int | None

Maximum parallel shards for Jackhmmer. If None, one instance per shard.

nhmmer_n_cpu

int

default:"8"

Number of CPUs for Nhmmer. Going above 8 provides diminishing returns.

nhmmer_max_parallel_shards

int | None

Maximum parallel shards for Nhmmer. If None, one instance per shard.

Template Configuration

max_template_date

datetime.date

required

Latest allowed template release date. Templates after this date are filtered out.

Internal Functions

_get_protein_msa_and_templates

Cached function to avoid re-running MSA tools for identical sequences in homomers.

@functools.cache
def _get_protein_msa_and_templates(
    sequence: str,
    run_template_search: bool,
    uniref90_msa_config: msa_config.RunConfig,
    mgnify_msa_config: msa_config.RunConfig,
    small_bfd_msa_config: msa_config.RunConfig,
    uniprot_msa_config: msa_config.RunConfig,
    templates_config: msa_config.TemplatesConfig,
    pdb_database_path: str,
) -> tuple[msa.Msa, msa.Msa, templates_lib.Templates]

Returns unpaired MSA, paired MSA, and templates for a protein sequence.

_get_protein_templates

Cached function for template search only.

@functools.cache
def _get_protein_templates(
    sequence: str,
    input_msa_a3m: str,
    run_template_search: bool,
    templates_config: msa_config.TemplatesConfig,
    pdb_database_path: str,
) -> templates_lib.Templates

Searches for templates using provided MSA.

_get_rna_msa

Cached function for RNA MSA generation.

@functools.cache
def _get_rna_msa(
    sequence: str,
    nt_rna_msa_config: msa_config.NhmmerConfig,
    rfam_msa_config: msa_config.NhmmerConfig,
    rnacentral_msa_config: msa_config.NhmmerConfig,
) -> msa.Msa

Generates and deduplicates RNA MSA from three databases.

Usage Examples

Basic Pipeline Usage

import datetime
from alphafold3.data import pipeline
from alphafold3.common import folding_input

# Create configuration
config = pipeline.DataPipelineConfig(
    # Binary paths
    jackhmmer_binary_path='/usr/bin/jackhmmer',
    nhmmer_binary_path='/usr/bin/nhmmer',
    hmmalign_binary_path='/usr/bin/hmmalign',
    hmmsearch_binary_path='/usr/bin/hmmsearch',
    hmmbuild_binary_path='/usr/bin/hmmbuild',
    
    # Database paths
    small_bfd_database_path='/databases/bfd/bfd.fasta',
    mgnify_database_path='/databases/mgnify/mgy_clusters.fa',
    uniprot_cluster_annot_database_path='/databases/uniprot/uniprot.fa',
    uniref90_database_path='/databases/uniref90/uniref90.fa',
    ntrna_database_path='/databases/ntrna/nt_rna.fasta',
    rfam_database_path='/databases/rfam/rfam.fasta',
    rna_central_database_path='/databases/rnacentral/rnacentral.fasta',
    seqres_database_path='/databases/pdb/seqres.fasta',
    pdb_database_path='/databases/pdb/mmcif_files',
    
    # Settings
    jackhmmer_n_cpu=8,
    nhmmer_n_cpu=8,
    max_template_date=datetime.date(2021, 9, 30),
)

# Initialize pipeline
data_pipeline = pipeline.DataPipeline(config)

# Create input
protein = folding_input.ProteinChain(
    id='A',
    sequence='MKFLKFSLLTAVLLSVVFAFSSCGDDDDTYPYDVPDYA',
    ptms=[]
)

fold_input = folding_input.Input(
    name='my_protein',
    chains=[protein],
    rng_seeds=[42]
)

# Run pipeline
processed_input = data_pipeline.process(fold_input)

# Access results
processed_chain = processed_input.protein_chains[0]
print(f"Unpaired MSA depth: {processed_chain.unpaired_msa.count('>')}")
print(f"Paired MSA depth: {processed_chain.paired_msa.count('>')}")
print(f"Templates found: {len(processed_chain.templates)}")

Processing Individual Chains

# Process only protein chain
protein_chain = folding_input.ProteinChain(
    id='A',
    sequence='MKFLKFSLLTAVLLSVVFAFSSCGDDDDTYPYDVPDYA',
    ptms=[]
)

processed_protein = data_pipeline.process_protein_chain(protein_chain)

# Process only RNA chain
rna_chain = folding_input.RnaChain(
    id='B',
    sequence='GGGGCUAUAGCUCAGCGGUAGAGCAGUGGAUUGAAAUCCAUUGUGUCGCUGGUUCGAUUCCGGUUAGUCUCCA',
    modifications=[]
)

processed_rna = data_pipeline.process_rna_chain(rna_chain)

Custom MSA (Skip Pipeline)

# Provide custom MSA to skip pipeline
protein_with_msa = folding_input.ProteinChain(
    id='A',
    sequence='MKFLKFSLLTAVLLSVVFAFSSCGDDDDTYPYDVPDYA',
    ptms=[],
    unpaired_msa='>query\nMKFLKFSLLTAVLLSVVFAFSSCGDDDDTYPYDVPDYA\n>seq1\nMKFLKFSLLTAVLLSVVFAFSSCGDDDDTYPYDVPDYA\n',
    paired_msa='',  # Empty paired MSA
    templates=[]     # No templates
)

# Pipeline will skip this chain
fold_input = folding_input.Input(
    name='custom_msa',
    chains=[protein_with_msa],
    rng_seeds=[42]
)

# No processing will occur
processed = data_pipeline.process(fold_input)

Multi-Chain Complex

# Create complex with protein and RNA
protein = folding_input.ProteinChain(
    id='A',
    sequence='MKFLKFSLLTAVLLSVVFAFSSCGDDDDTYPYDVPDYA',
    ptms=[]
)

rna = folding_input.RnaChain(
    id='B',
    sequence='GGGGCUAUAGCUCAGCGGUAGAGCAGUGGAUUGAAAUCCAUUGUGUCGCUGGUUCGAUUCCGGUUAGUCUCCA',
    modifications=[]
)

ligand = folding_input.Ligand(
    id='C',
    ccd_ids=['ATP']
)

fold_input = folding_input.Input(
    name='complex',
    chains=[protein, rna, ligand],
    rng_seeds=[42]
)

# Pipeline processes protein and RNA, passes through ligand
processed = data_pipeline.process(fold_input)

Sharded Database Configuration

# For sharded databases, specify Z-values
config = pipeline.DataPipelineConfig(
    # ... other paths ...
    
    # Sharded database with Z-value
    small_bfd_database_path='/databases/bfd/bfd_{000..049}.fasta',
    small_bfd_z_value=138_515_945,  # Total sequences across all shards
    
    uniref90_database_path='/databases/uniref90/uniref90_{000..019}.fasta',
    uniref90_z_value=103_000_000,
    
    # Parallel shard processing
    jackhmmer_max_parallel_shards=4,  # Process 4 shards at once
    
    max_template_date=datetime.date(2021, 9, 30),
)

MSA Format

MSAs are returned in A3M format:

>query
MKFLKFSLLTAVLLSVVFAFSSCGDDDDTYPYDVPDYA
>seq1/1-39
MKFLKFSLLTAVLLSVVFAFSSCGDDDDTYPYDVPDYA
>seq2/1-39
MKFLK-SLLTAVLLSVVFAFSSCGDDDDTYPYDVPDYA
>seq3/1-39
mkflkfslltavllsvvfafsscgddddtypydvpdya

First sequence is query (uppercase, no gaps)
Subsequent sequences are hits (lowercase = insertion, - = deletion)
Headers include sequence ID and alignment range

Template Format

Templates are returned as folding_input.Template objects:

template = folding_input.Template(
    mmcif='<mmCIF string>',
    query_to_template_map={
        1: 10,   # Query residue 1 -> Template residue 10
        2: 11,   # Query residue 2 -> Template residue 11
        3: 12,   # Query residue 3 -> Template residue 12
        # ... etc
    }
)

Performance Considerations

Caching

Functions are decorated with @functools.cache to avoid redundant searches:

Identical sequences in homomers are processed only once
Cache is per-Python-process (not persistent across runs)

Parallelization

MSA tools run in parallel using ThreadPoolExecutor:

4 protein databases searched simultaneously
3 RNA databases searched simultaneously
Template search can run concurrently with MSA

Timing

Typical processing times (8 CPUs):

Protein MSA: 5-30 minutes depending on databases
RNA MSA: 3-15 minutes depending on databases
Template search: 1-5 minutes
Total for protein: 10-35 minutes

Database Sizes

Recommended database sizes (2021-2022 versions):

Small BFD: ~138M sequences, ~2.5 TB
Mgnify: ~125M sequences, ~120 GB
UniRef90: ~103M sequences, ~35 GB
UniProt: ~224M sequences, ~80 GB
NT-RNA: ~47 MB
Rfam: ~13 MB
RNAcentral: ~42 MB

Core Modules

Data Processing

Model Components

Structure

Constants & Utilities

Overview

DataPipeline Class

Methods

process

process_protein_chain

process_rna_chain

DataPipelineConfig

Binary Paths

Database Paths

Z-values

CPU Configuration

Template Configuration

Internal Functions

_get_protein_msa_and_templates

_get_protein_templates

_get_rna_msa

Usage Examples

Basic Pipeline Usage

Processing Individual Chains

Custom MSA (Skip Pipeline)

Multi-Chain Complex

Sharded Database Configuration

MSA Format

Template Format

Performance Considerations

Caching

Parallelization

Timing

Database Sizes

See Also

Build docs developers (and LLMs) love

Core Modules

Data Processing

Model Components

Structure

Constants & Utilities

​Overview

​DataPipeline Class

​Methods

​process

​process_protein_chain

​process_rna_chain

​DataPipelineConfig

​Binary Paths

​Database Paths

​Z-values

​CPU Configuration

​Template Configuration

​Internal Functions

​_get_protein_msa_and_templates

​_get_protein_templates

​_get_rna_msa

​Usage Examples

​Basic Pipeline Usage

​Processing Individual Chains

​Custom MSA (Skip Pipeline)

​Multi-Chain Complex

​Sharded Database Configuration

​MSA Format

​Template Format

​Performance Considerations

​Caching

​Parallelization

​Timing

​Database Sizes

​See Also

Build docs developers (and LLMs) love

Overview

DataPipeline Class

Methods

process

process_protein_chain

process_rna_chain

DataPipelineConfig

Binary Paths

Database Paths

Z-values

CPU Configuration

Template Configuration

Internal Functions

_get_protein_msa_and_templates

_get_protein_templates

_get_rna_msa

Usage Examples

Basic Pipeline Usage

Processing Individual Chains

Custom MSA (Skip Pipeline)

Multi-Chain Complex

Sharded Database Configuration

MSA Format

Template Format

Performance Considerations

Caching

Parallelization

Timing

Database Sizes

See Also