Overview
The AlphaFold 3 data pipeline transforms raw input sequences into rich feature representations by searching genetic and structural databases. This CPU-intensive stage can be run independently from the GPU inference stage.The data pipeline is controlled by
--run_data_pipeline=true and produces a *_data.json file containing all features needed for inference.Pipeline Architecture
The data pipeline consists of three parallel search processes:Entry Point
The data pipeline is orchestrated bysrc/alphafold3/data/pipeline.py:
The pipeline uses
@functools.cache to avoid re-running searches for identical sequences in homomers, significantly improving performance for symmetric complexes.Protein MSA Generation
Overview
Protein MSA generation searches multiple genetic databases to find homologous sequences:Parallel Execution
MSA tools run in parallel using a thread pool:Jackhmmer Tool
Jackhmmer performs iterative sequence search:- e_value: Statistical significance threshold (default: 0.0001)
- iterations: Number of search iterations (default: 1)
- z_value: Database size for E-value calculation
Database Configuration
Database Configuration
UniRef90 (
uniref90_2022_05):- Clustered at 90% sequence identity
- Fast initial search
- Good balance of speed and coverage
mgy_clusters_2022_05):- Metagenomic sequences
- Diverse environmental samples
- Complements UniRef with novel sequences
bfd-first_non_consensus_sequences.fasta):- Large, diverse sequence database
- Can be sharded for faster searching
- Deepest evolutionary coverage
MSA Processing
Raw MSA results are processed into a structured format:- Deduplication: Remove identical sequences (ignoring insertions)
- Validation: Ensure first sequence matches query
- Format conversion: Convert to A3M format (gaps as
-, insertions as lowercase)
MSA Pairing for Multimers
When predicting complexes, MSA pairing ensures sequences from the same organism align:Pairing uses UniProt organism IDs extracted from sequence headers. Properly paired MSAs significantly improve multimer prediction quality by preserving co-evolutionary signals.
- Match sequences by organism ID in headers
- Preserve relative positioning of paired sequences
- Insert gaps for unpaired sequences
unpairedMsa field to maintain exact control over pairing.
RNA/DNA MSA Generation
Nhmmer Search
RNA and DNA use Nhmmer instead of Jackhmmer:- RFam: RNA families database
- RNACentral: Comprehensive RNA sequence database
- NT: Nucleotide database for DNA
Processing
RNA/DNA MSA processing follows similar steps to protein:- Search genetic databases
- Parse results in STOCKHOLM/A3M format
- Deduplicate sequences
- Validate against query sequence
RNA/DNA MSAs are generally shallower than protein MSAs due to fewer available sequences in databases. This is expected and does not necessarily indicate poor prediction quality.
Template Search
Overview
Template search finds structurally similar proteins in the PDB to provide spatial priors:Hmmsearch Tool
Template search uses Hmmsearch:Template Filtering
Templates are filtered based on several criteria:max_template_date is critical for fair benchmarking. Set to the earliest date that ensures the target was not in the training set (e.g., training cutoff date).
- Date filter: Remove templates released after
max_template_date - Self-hit filter: Remove if
> 95%of query aligns (likely same protein) - Coverage filter: Require
> 10%alignment coverage - Quality sort: Rank by sequence identity and resolution
- Top-K selection: Keep best
max_hitstemplates (default: 20)
Template Processing
Selected templates are processed into model inputs:- Aligned coordinates: 3D positions for aligned residues
- Sequence alignment: Mapping from query to template
- Metadata: Resolution, release date, confidence scores
- Template features: Distance maps, angles, masks
Structure Store
Templates are retrieved from a local PDB mirror:- Loads mmCIF files from local PDB mirror
- Caches parsed structures
- Handles missing or malformed entries
- Extracts relevant chains
Template-Free Mode
Templates can be disabled for template-free prediction:--run_template_search=false in the pipeline.
Ligand Processing
Chemical Component Dictionary (CCD)
Ligands specified by CCD codes are processed directly:- Atom names and elements
- Bond connectivity and orders
- Ideal coordinates
- Chemical properties
SMILES Processing
Ligands specified by SMILES are processed via RDKit:Conformer Generation
Conformer Generation
RDKit conformer generation parameters:If conformer generation fails:
- Try increasing
--conformer_max_iterations - Use user-provided CCD with ideal coordinates
- Model will output NaN confidences for ligand if no coordinates available
- Very flexible molecules
- Unusual ring systems
- Macrocycles
User-Provided CCD
Custom ligands can be defined in mmCIF format:- Custom ligands not in standard CCD
- Covalent bond specifications
- Reference coordinates for difficult conformers
Feature Merging
After all searches complete, features are merged:- Concatenate chains: Combine all chains into single tensors
- Align MSAs: Ensure MSA depths match across chains
- Pad templates: Standardize template dimensions
- Create masks: Track valid positions and features
- Compute relative encodings: Chain boundaries, positions
- Generate atom layout: Map tokens to atoms
MSA Depth Balancing
MSAs are subsampled/padded to consistent depth:Output Format
The merged features are serialized to*_data.json:
Performance Optimization
Parallelization
The pipeline parallelizes across:- Multiple databases: UniRef90, MGnify, BFD run in parallel
- Multiple chains: Independent searches run concurrently
- Tool invocations: Jackhmmer instances run in separate threads
Caching
- Homomeric complexes (same sequence repeated)
- Multiple runs with same input
- Identical chains in different complexes
Database Sharding
Large databases can be sharded for faster access:- Faster I/O on parallel filesystems
- Distributed search across multiple nodes
- Reduced memory footprint per shard
Error Handling
MSA Search Failures
MSA Search Failures
If MSA search fails (e.g., database unavailable):
- Pipeline continues with empty MSA (just query sequence)
- Prediction quality will be reduced
- Check database paths and permissions
Template Search Failures
Template Search Failures
If template search fails:
- Pipeline continues template-free
- Uses only MSA information
- Still produces valid predictions
Ligand Processing Failures
Ligand Processing Failures
If RDKit conformer generation fails:
- Use ideal coordinates from CCD if available
- Use reference coordinates if template date allows
- Output NaN confidences for ligand
- Coordinates set to (0,0,0) as last resort
Custom MSAs and Templates
Users can provide custom MSAs and templates to skip searches:Custom MSA
- A3M format (FASTA with gaps and lowercase insertions)
- First sequence must match query exactly
- All sequences same length after removing insertions
Custom Templates
- Single-chain mmCIF file
- Query/template index mapping (0-based)
- Template indices account for unresolved residues
Custom MSAs/templates enable:
- Reproducible benchmarking
- Incorporating proprietary data
- Testing specific hypotheses
- Avoiding expensive database searches
Diagnostic Outputs
The pipeline logs detailed information:- MSA depth per chain
- Number of templates found
- Search time per database
- Cache hits for homomers
Next Steps
Inference Pipeline
Learn how features are processed through the neural network
Model Architecture
Understand the network components in detail