Skip to main content

Overview

The AlphaFold 3 data pipeline transforms raw input sequences into rich feature representations by searching genetic and structural databases. This CPU-intensive stage can be run independently from the GPU inference stage.
The data pipeline is controlled by --run_data_pipeline=true and produces a *_data.json file containing all features needed for inference.

Pipeline Architecture

The data pipeline consists of three parallel search processes:

Entry Point

The data pipeline is orchestrated by src/alphafold3/data/pipeline.py:
# From src/alphafold3/data/pipeline.py:71
def _get_protein_msa_and_templates(
    sequence: str,
    run_template_search: bool,
    uniref90_msa_config: msa_config.RunConfig,
    mgnify_msa_config: msa_config.RunConfig,
    small_bfd_msa_config: msa_config.RunConfig,
    uniprot_msa_config: msa_config.RunConfig,
    templates_config: msa_config.TemplatesConfig,
    pdb_database_path: str,
) -> tuple[msa.Msa, msa.Msa, templates_lib.Templates]:
The pipeline uses @functools.cache to avoid re-running searches for identical sequences in homomers, significantly improving performance for symmetric complexes.

Protein MSA Generation

Overview

Protein MSA generation searches multiple genetic databases to find homologous sequences:
1

UniRef90 Search

Primary search against clustered UniProt database
2

MGnify Search

Search metagenomic sequences for additional diversity
3

Small BFD Search

Search BFD (Big Fantastic Database) for deeper evolutionary coverage
4

UniProt Search

Paired MSA search for multimer pairing information

Parallel Execution

MSA tools run in parallel using a thread pool:
# From src/alphafold3/data/pipeline.py:85
with futures.ThreadPoolExecutor(max_workers=4) as executor:
    uniref90_msa_future = executor.submit(
        msa.get_msa,
        target_sequence=sequence,
        run_config=uniref90_msa_config,
        chain_poly_type=mmcif_names.PROTEIN_CHAIN,
    )
    mgnify_msa_future = executor.submit(...)
    small_bfd_msa_future = executor.submit(...)
    uniprot_msa_future = executor.submit(...)
This parallelization significantly reduces wall-clock time for the data pipeline.

Jackhmmer Tool

Jackhmmer performs iterative sequence search:
# From src/alphafold3/data/tools/jackhmmer.py
class Jackhmmer:
    """Runs Jackhmmer to search for homologous sequences."""
    
    def query(self, query_seq: str) -> str:
        """Runs jackhmmer search."""
        # Returns results in A3M format
Key parameters:
  • e_value: Statistical significance threshold (default: 0.0001)
  • iterations: Number of search iterations (default: 1)
  • z_value: Database size for E-value calculation
UniRef90 (uniref90_2022_05):
  • Clustered at 90% sequence identity
  • Fast initial search
  • Good balance of speed and coverage
MGnify (mgy_clusters_2022_05):
  • Metagenomic sequences
  • Diverse environmental samples
  • Complements UniRef with novel sequences
Small BFD (bfd-first_non_consensus_sequences.fasta):
  • Large, diverse sequence database
  • Can be sharded for faster searching
  • Deepest evolutionary coverage
Database paths configured via flags:
--uniref90_database_path='${DB_DIR}/uniref90_2022_05.fasta'
--mgnify_database_path='${DB_DIR}/mgy_clusters_2022_05.fa'
--small_bfd_database_path='${DB_DIR}/bfd-first_non_consensus_sequences.fasta'

MSA Processing

Raw MSA results are processed into a structured format:
# From src/alphafold3/data/msa.py:52
class Msa:
    """Multiple Sequence Alignment container."""
    
    def __init__(
        self,
        query_sequence: str,
        chain_poly_type: str,
        sequences: Sequence[str],
        descriptions: Sequence[str],
        deduplicate: bool = True,
    ):
Processing steps:
  1. Deduplication: Remove identical sequences (ignoring insertions)
  2. Validation: Ensure first sequence matches query
  3. Format conversion: Convert to A3M format (gaps as -, insertions as lowercase)
# From src/alphafold3/data/msa.py:89
# A replacement table that removes all lowercase characters
deletion_table = str.maketrans('', '', string.ascii_lowercase)
sequence_no_deletions = seq.translate(deletion_table)

MSA Pairing for Multimers

When predicting complexes, MSA pairing ensures sequences from the same organism align:
# From src/alphafold3/model/msa_pairing.py
def pair_msas(
    msas: Sequence[Msa],
    max_hits: int = 10000,
) -> PairedMsa:
    """Pairs MSAs based on organism identifiers."""
Pairing uses UniProt organism IDs extracted from sequence headers. Properly paired MSAs significantly improve multimer prediction quality by preserving co-evolutionary signals.
Pairing strategies:
  • Match sequences by organism ID in headers
  • Preserve relative positioning of paired sequences
  • Insert gaps for unpaired sequences
Users can provide custom paired MSAs via the unpairedMsa field to maintain exact control over pairing.

RNA/DNA MSA Generation

RNA and DNA use Nhmmer instead of Jackhmmer:
# From src/alphafold3/data/tools/nhmmer.py
class Nhmmer:
    """Runs Nhmmer to search for homologous nucleic acid sequences."""
Databases:
  • RFam: RNA families database
  • RNACentral: Comprehensive RNA sequence database
  • NT: Nucleotide database for DNA
Configuration:
--rfam_database_path='${DB_DIR}/rfam_14_4.fa'
--rnacentral_database_path='${DB_DIR}/rnacentral_21_0.fa'
--nt_database_path='${DB_DIR}/nt_rna_2023_02_23.fa'

Processing

RNA/DNA MSA processing follows similar steps to protein:
  1. Search genetic databases
  2. Parse results in STOCKHOLM/A3M format
  3. Deduplicate sequences
  4. Validate against query sequence
RNA/DNA MSAs are generally shallower than protein MSAs due to fewer available sequences in databases. This is expected and does not necessarily indicate poor prediction quality.

Overview

Template search finds structurally similar proteins in the PDB to provide spatial priors:
# From src/alphafold3/data/pipeline.py:30
def _get_protein_templates(
    sequence: str,
    input_msa_a3m: str,
    run_template_search: bool,
    templates_config: msa_config.TemplatesConfig,
    pdb_database_path: str,
) -> templates_lib.Templates:
    """Searches for templates for a single protein chain."""

Hmmsearch Tool

Template search uses Hmmsearch:
1

Build HMM Profile

Create Hidden Markov Model from MSA using Hmmbuild
2

Search PDB

Search profile against PDB sequence database using Hmmsearch
3

Filter Results

Apply date, identity, and quality filters
4

Extract Structures

Retrieve and process template structures from PDB

Template Filtering

Templates are filtered based on several criteria:
# From src/alphafold3/data/templates.py
class TemplateFilterConfig:
    max_template_date: datetime.date    # Exclude structures after this date
    max_subsequence_ratio: float = 0.95  # Filter nearly complete sequences
    min_align_ratio: float = 0.1        # Minimum alignment coverage
    max_hits: int = 20                  # Maximum templates to keep
max_template_date is critical for fair benchmarking. Set to the earliest date that ensures the target was not in the training set (e.g., training cutoff date).
Filtering logic:
  1. Date filter: Remove templates released after max_template_date
  2. Self-hit filter: Remove if > 95% of query aligns (likely same protein)
  3. Coverage filter: Require > 10% alignment coverage
  4. Quality sort: Rank by sequence identity and resolution
  5. Top-K selection: Keep best max_hits templates (default: 20)

Template Processing

Selected templates are processed into model inputs:
# From src/alphafold3/data/templates.py
class Templates:
    query_sequence: str
    hits: list[TemplateHit]
    max_template_date: datetime.date
    structure_store: structure_stores.StructureStore
Each template provides:
  • Aligned coordinates: 3D positions for aligned residues
  • Sequence alignment: Mapping from query to template
  • Metadata: Resolution, release date, confidence scores
  • Template features: Distance maps, angles, masks

Structure Store

Templates are retrieved from a local PDB mirror:
# From src/alphafold3/data/structure_stores.py
class StructureStore:
    """Provides access to PDB structures."""
    
    def __init__(self, database_path: str):
        self.database_path = database_path
The structure store:
  • Loads mmCIF files from local PDB mirror
  • Caches parsed structures
  • Handles missing or malformed entries
  • Extracts relevant chains

Template-Free Mode

Templates can be disabled for template-free prediction:
{
  "protein": {
    "id": "A",
    "sequence": "ACDEFGHIKLMNPQRSTVWY",
    "templates": []
  }
}
Or by setting --run_template_search=false in the pipeline.

Ligand Processing

Chemical Component Dictionary (CCD)

Ligands specified by CCD codes are processed directly:
# From src/alphafold3/constants/chemical_components.py
def get_ccd_component(ccd_code: str) -> ChemicalComponent:
    """Retrieves chemical component from CCD."""
CCD provides:
  • Atom names and elements
  • Bond connectivity and orders
  • Ideal coordinates
  • Chemical properties

SMILES Processing

Ligands specified by SMILES are processed via RDKit:
# From src/alphafold3/data/tools/rdkit_utils.py
def smiles_to_mol(smiles: str) -> Chem.Mol:
    """Convert SMILES to RDKit Mol object."""
Processing steps:
1

Parse SMILES

Convert SMILES string to RDKit Mol object
2

Generate 3D Conformer

Use RDKit ETKDG algorithm to generate reference geometry
3

Extract Features

Extract atoms, bonds, charges, and coordinates
4

Create Ligand Features

Convert to model-compatible feature format
RDKit conformer generation parameters:
# Configurable via flags
--conformer_max_iterations=<N>  # Max iterations for conformer search
If conformer generation fails:
  1. Try increasing --conformer_max_iterations
  2. Use user-provided CCD with ideal coordinates
  3. Model will output NaN confidences for ligand if no coordinates available
Common failure cases:
  • Very flexible molecules
  • Unusual ring systems
  • Macrocycles

User-Provided CCD

Custom ligands can be defined in mmCIF format:
{
  "userCCD": "data_MY-LIG\n_chem_comp.id MY-LIG\n...",
  "sequences": [
    {
      "ligand": {
        "id": "L",
        "ccdCodes": ["MY-LIG"]
      }
    }
  ]
}
This enables:
  • Custom ligands not in standard CCD
  • Covalent bond specifications
  • Reference coordinates for difficult conformers

Feature Merging

After all searches complete, features are merged:
# From src/alphafold3/model/merging_features.py
def merge_chain_features(
    protein_msas: dict[str, Msa],
    rna_msas: dict[str, Msa],
    templates: dict[str, Templates],
    ligand_features: dict[str, LigandFeatures],
) -> MergedFeatures:
Merging process:
  1. Concatenate chains: Combine all chains into single tensors
  2. Align MSAs: Ensure MSA depths match across chains
  3. Pad templates: Standardize template dimensions
  4. Create masks: Track valid positions and features
  5. Compute relative encodings: Chain boundaries, positions
  6. Generate atom layout: Map tokens to atoms

MSA Depth Balancing

MSAs are subsampled/padded to consistent depth:
# Configuration
max_msa_depth = 1024  # Maximum MSA sequences

# If MSA exceeds max_depth: cluster and subsample
# If MSA below max_depth: pad with gaps

Output Format

The merged features are serialized to *_data.json:
{
  "name": "my_protein",
  "sequences": [...],
  "msa_features": {
    "msa": [[...]],  // [num_chains, num_msa, num_tokens]
    "deletion_matrix": [...],
    "descriptions": [...]
  },
  "template_features": {
    "template_aatype": [...],
    "template_all_atom_positions": [...],
    // ... more template features
  },
  "token_features": {...},
  "atom_features": {...}
}
This file contains everything needed for inference, enabling data/inference separation.

Performance Optimization

Parallelization

The pipeline parallelizes across:
  • Multiple databases: UniRef90, MGnify, BFD run in parallel
  • Multiple chains: Independent searches run concurrently
  • Tool invocations: Jackhmmer instances run in separate threads

Caching

# From src/alphafold3/data/pipeline.py:29
@functools.cache
def _get_protein_templates(...):
    """Searches for templates for a single protein chain."""
Caching prevents redundant searches for:
  • Homomeric complexes (same sequence repeated)
  • Multiple runs with same input
  • Identical chains in different complexes

Database Sharding

Large databases can be sharded for faster access:
# Sharded database support
--small_bfd_database_path='${DB_DIR}/bfd_shard_*.fasta'
--small_bfd_z_value=123456789  # Total database size
Sharding enables:
  • Faster I/O on parallel filesystems
  • Distributed search across multiple nodes
  • Reduced memory footprint per shard

Error Handling

If MSA search fails (e.g., database unavailable):
  • Pipeline continues with empty MSA (just query sequence)
  • Prediction quality will be reduced
  • Check database paths and permissions
# Graceful degradation
if msa_search_failed:
    logging.warning("MSA search failed, using query only")
    msa = Msa(query_sequence, chain_poly_type, [], [])
If template search fails:
  • Pipeline continues template-free
  • Uses only MSA information
  • Still produces valid predictions
# Template-free fallback
if template_search_failed:
    templates = Templates(query_sequence, hits=[], ...)
If RDKit conformer generation fails:
  • Use ideal coordinates from CCD if available
  • Use reference coordinates if template date allows
  • Output NaN confidences for ligand
  • Coordinates set to (0,0,0) as last resort

Custom MSAs and Templates

Users can provide custom MSAs and templates to skip searches:

Custom MSA

{
  "protein": {
    "id": "A",
    "sequence": "ACDEFG",
    "unpairedMsa": ">query\nACDEFG\n>hit1\nACDEFG\n>hit2\nACDXFG",
    "pairedMsa": "",
    "templates": null
  }
}
Requirements:
  • A3M format (FASTA with gaps and lowercase insertions)
  • First sequence must match query exactly
  • All sequences same length after removing insertions

Custom Templates

{
  "protein": {
    "id": "A",
    "sequence": "ACDEFG",
    "templates": [
      {
        "mmcif": "data_template\n...",
        "queryIndices": [0, 1, 2, 3, 4, 5],
        "templateIndices": [10, 11, 12, 13, 14, 15]
      }
    ]
  }
}
Requirements:
  • Single-chain mmCIF file
  • Query/template index mapping (0-based)
  • Template indices account for unresolved residues
Custom MSAs/templates enable:
  • Reproducible benchmarking
  • Incorporating proprietary data
  • Testing specific hypotheses
  • Avoiding expensive database searches

Diagnostic Outputs

The pipeline logs detailed information:
logging.info('Getting protein MSAs for sequence %s', sequence)
logging.info('Getting %d protein templates took %.2f seconds', 
             num_templates, elapsed_time)
Key metrics to monitor:
  • MSA depth per chain
  • Number of templates found
  • Search time per database
  • Cache hits for homomers

Next Steps

Inference Pipeline

Learn how features are processed through the neural network

Model Architecture

Understand the network components in detail

Build docs developers (and LLMs) love