MoleculeEntry
Represents a single molecule with its properties, score, and fingerprint.Constructor
Parameters
SMILES string representing the molecular structure. Will be canonicalized automatically using RDKit.
Oracle score for the molecule. Typically set after evaluation.
Additional properties to store with the molecule. Accessible via
add_props attribute.Attributes
Canonicalized SMILES string.
Oracle score for the molecule.
RDKit molecule object (only if SMILES is non-empty).
Morgan fingerprint (radius=2, 2048 bits) for similarity calculations.
List of similar molecules used in prompt construction.
Additional properties passed via kwargs.
Methods
__eq__(other)
Compares molecules based on canonical SMILES.
__lt__(other)
Compares molecules by score (or SMILES if scores are equal). Used for sorting.
__hash__()
Returns hash of the SMILES string. Allows use in sets and as dictionary keys.
__str__() / __repr__()
Returns human-readable representation.
Example Usage
OptimEntry
Represents an optimization entry containing multiple molecules with a target molecule to generate.Constructor
Parameters
The target molecule to generate. Can be
None initially and set later.Context molecules used to construct the optimization prompt.
Attributes
Target molecule to generate or that was generated.
Context molecules for the prompt.
Status indicating if this entry is for training, validation, or unassigned:
EntryStatus.none(0): Not yet assignedEntryStatus.train(1): Training setEntryStatus.valid(2): Validation set
Methods
to_prompt(is_generation, include_oracle_score, config, max_score)
Generates a formatted prompt for model training or generation.
Parameters:
If
True, creates a prompt for generation (ends with [START_SMILES]). If False, creates a training prompt with complete SMILES.Whether to include oracle scores in the prompt (required for rejection sampling).
Configuration dictionary with
strategy, eos_token, and sim_range keys.Maximum score achieved so far, used to sample desired scores during generation.
str - Formatted prompt string
Example:
contains_entry(mol_entry)
Checks if a molecule already exists in the entry (including context and similar molecules).
Parameters:
mol_entry(MoleculeEntry): Molecule to check
bool - True if molecule is already present
Pool
Manages a pool of high-scoring optimization entries with diversity filtering and train/validation splitting.Constructor
Parameters
Maximum number of entries to maintain in the pool.
Percentage of pool entries reserved for validation (0.0 to 1.0).
Attributes
Maximum pool capacity.
Current optimization entries in the pool, sorted by score (descending).
Number of entries reserved for validation:
int(size * validation_perc + 1).Methods
add(entries, diversity_score=1.0)
Adds new entries to the pool with diversity filtering.
Parameters:
New optimization entries to add to the pool.
Maximum Tanimoto similarity threshold. Molecules more similar than this are considered duplicates and filtered out.
- Merges new entries with existing pool
- Sorts by score (descending)
- Removes duplicates and overly similar molecules
- Keeps top
sizeentries - Assigns train/validation status to maintain
validation_percratio
get_train_valid_entries()
Returns separate lists of training and validation entries.
Returns: Tuple[List[OptimEntry], List[OptimEntry]] - (train_entries, valid_entries)
random_subset(subset_size)
Samples random entries from the pool without replacement.
Parameters:
subset_size(int): Number of entries to sample
List[OptimEntry] - Random sample (up to pool size)
__len__()
Returns current number of entries in the pool.
Example Usage
Helper Functions
canonicalize(smiles)
Converts a SMILES string to its canonical form using RDKit.
get_morgan_fingerprint(mol)
Generates a Morgan fingerprint for a molecule.
Parameters:
mol(rdkit.Chem.Mol): RDKit molecule object
rdkit.DataStructs.ExplicitBitVect - Morgan fingerprint (radius=2, 2048 bits)
get_maccs_fingerprint(mol)
Generates a MACCS fingerprint for a molecule.
Parameters:
mol(rdkit.Chem.Mol): RDKit molecule object
rdkit.DataStructs.ExplicitBitVect - MACCS keys fingerprint
tanimoto_dist_func(fing1, fing2, fingerprint="morgan")
Calculates Tanimoto similarity between two fingerprints.
Parameters:
fing1: First fingerprintfing2: Second fingerprintfingerprint(str): Fingerprint type (currently only “morgan” is used)
float - Tanimoto similarity (0.0 to 1.0)
set_seed(seed_value)
Sets random seeds for reproducibility across random, numpy, and PyTorch.
Parameters:
seed_value(int): Random seed value
generate_random_number(lower, upper)
Generates a random float between lower and upper bounds.
Parameters:
lower(float): Lower bound (inclusive)upper(float): Upper bound (inclusive)
float - Random number in [lower, upper]
create_prompt_with_similars(mol_entry, sim_range=None)
Creates a prompt string with similar molecules for generation.
Parameters:
mol_entry(MoleculeEntry): Molecule entry with similar moleculessim_range(list[float], optional): [min_sim, max_sim] range for similarity values
str - Formatted prompt with SIMILAR tags
Source: mol_opt/utils.py:112