Overview
The features module handles the data-side processing of input features for AlphaFold 3. It provides dataclasses and functions for converting raw input data into model-ready tensors, including MSA processing, template features, token features, and atom layout management.Core Types
BatchDict
Type alias for feature dictionaries passed to the model.xnp_ndarray is a union type for NumPy or JAX arrays.
PaddingShapes
Defines padding dimensions for batched model inputs.Maximum number of tokens (residues + ligand atoms) in the sequence.
Maximum number of MSA rows to include.
Maximum number of chains in the complex.
Maximum number of structural templates.
Maximum number of atoms per token.
MSA Features
MSA Dataclass
Contains multiple sequence alignment features.MSA sequences encoded as integers. Shape:
(msa_size, num_tokens)Binary mask for valid MSA positions. Shape:
(msa_size, num_tokens)Number of deletions at each MSA position. Shape:
(msa_size, num_tokens)Occurrence of each residue type along the sequence, averaged over MSA rows. Shape:
(num_tokens, num_residue_types)Occurrence of deletions along the sequence, averaged over MSA rows. Shape:
(num_tokens,)Total number of MSA alignments (scalar).
compute_features
Computes MSA features from folding input.Atom layout containing one representative atom per token.
Token indices for non-flattened standard residues.
Padding dimensions for the output tensors.
Input data containing MSAs for each chain.
Name for logging (typically mmCIF ID).
Maximum number of paired sequences per species.
Whether to deduplicate overlapping sequences in paired MSA.
Methods
index_msa_rows
Subsample MSA rows by indices.from_data_dict
Create MSA from batch dictionary.as_data_dict
Convert MSA to batch dictionary.Template Features
Templates Dataclass
Contains structural template features.Amino acid type encoded as integers. Shape:
(num_templates, num_tokens)3D coordinates of template atoms. Shape:
(num_templates, num_tokens, 24, 3)Binary mask for valid template atoms. Shape:
(num_templates, num_tokens, 24)compute_features
Computes template features from protein chain templates.Atom layout with representative atom per token.
Indices for standard (non-flattened) tokens.
Padding dimensions.
Input containing template structures.
Maximum number of templates to use.
Name for logging.
Token Features
TokenFeatures Dataclass
Per-token features including chain identifiers and token types.Residue index from input structure. Shape:
(num_tokens,)Sequential token index (1-indexed). Shape:
(num_tokens,)Encoded residue/ligand type. Shape:
(num_tokens,)Binary mask for valid tokens. Shape:
(num_tokens,)Total sequence length (scalar).
Asymmetric unit ID for each chain. For A3B2 stoichiometry: 1, 2, 3, 4, 5. Shape:
(num_tokens,)Entity ID grouping identical sequences. For A3B2: 1, 1, 1, 2, 2. Shape:
(num_tokens,)Symmetry ID within entity. For A3B2: 1, 2, 3, 1, 2. Shape:
(num_tokens,)Boolean mask for protein tokens. Shape:
(num_tokens,)Boolean mask for RNA tokens. Shape:
(num_tokens,)Boolean mask for DNA tokens. Shape:
(num_tokens,)Boolean mask for ligand tokens. Shape:
(num_tokens,)Boolean mask for non-standard polymer chains. Shape:
(num_tokens,)Boolean mask for water molecules. Shape:
(num_tokens,)Tokenization
tokenizer
Maps flat atom layout to tokens for the Evoformer.Flat atom layout containing all atoms to predict.
Chemical components dictionary.
Number of atom slots per token.
Whether to use one token per atom for non-standard residues.
Name for logging (typically mmCIF ID).
- Standard protein residues: 1 token per residue (CA representative atom)
- Standard nucleic residues: 1 token per residue (C1’ representative atom)
- Non-standard polymer residues: 1 token per atom if
flatten_non_standard_residues=True - Ligands: 1 token per atom
Additional Feature Classes
PredictedStructureInfo
Information for working with predicted structures.PolymerLigandBondInfo
Information about polymer-ligand bonds.LigandLigandBondInfo
Information about ligand-ligand bonds.PseudoBetaInfo
Information for extracting pseudo-beta and equivalent atoms.- Protein: CB (or CA for glycine)
- Nucleic acids (purines A/G/DA/DG): C4
- Nucleic acids (pyrimidines C/T/U/DC/DT): C2
- Ligands: First atom
Chains
Chain identification dataclass.Usage Example
Related
- Inference - Model inference and predictions
- Post-processing - Output file generation