Overview
The featurisation module transforms validated folding inputs into feature batches that the AlphaFold 3 model can process. It handles the conversion of protein, RNA, DNA, and ligand inputs into the numerical representations required for structure prediction.Functions
validate_fold_input
Validates that a fold input contains all required MSA and template data for featurisation.The folding input to validate. Must contain MSA and template data for all protein and RNA chains.
ValueError: If any protein chain is missing unpaired MSA, paired MSA, or templatesValueError: If any RNA chain is missing unpaired MSA
featurise_input
Featurises a folding input and returns model-ready feature batches.The input to featurise. Must be validated before calling this function.
The Chemical Component Dictionary containing ligand and modified residue definitions.
Bucket sizes for padding data to avoid model recompilation. If
None, calculates appropriate bucket size from token count. If provided, must be strictly increasing sequence of at least one integer. Raises error if tokens exceed largest bucket.Maximum date for using CCD model coordinates as fallback when RDKit conformer generation fails and ideal coordinates are unavailable. Only applies to components released before this date.
Override for maximum RDKit conformer search iterations.
Whether to deduplicate unpaired MSA against paired MSA. Default matches AlphaFold 3 paper methodology. Set to
False when providing custom paired MSA via unpaired MSA field to preserve exact sequences.Whether to print progress messages during featurisation.
A featurised batch for each RNG seed in the input. Each batch contains all features required for model inference.
Pipeline Integration
The featurisation module usesWholePdbPipeline internally to process inputs:
Performance Considerations
Bucketing Strategy
Proper bucket configuration is critical for performance:- With buckets: Data is padded to nearest bucket size, enabling model compilation reuse
- Without buckets: Model recompiles for each unique input size, significantly slower
- Bucket selection: Use exponentially spaced buckets covering expected input sizes
MSA Overlap Resolution
Whenresolve_msa_overlaps=True (default):
- Unpaired MSA sequences are deduplicated against paired MSA
- Reduces redundancy and computational cost
- Matches published AlphaFold 3 methodology
resolve_msa_overlaps=False:
- All MSA sequences are kept exactly as provided
- Use when providing custom paired alignments
- Required when manual MSA pairing must be preserved
Related Modules
- MSA Processing - Multiple sequence alignment handling
- Template Processing - Structural template features