Utility Classes

MoleculeEntry

Represents a single molecule with its properties, score, and fingerprint.

Constructor

class MoleculeEntry:
    def __init__(self, smiles, score=0, **kwargs)

Parameters

smiles

str

required

SMILES string representing the molecular structure. Will be canonicalized automatically using RDKit.

score

float

default:"0"

Oracle score for the molecule. Typically set after evaluation.

**kwargs

dict

default:"{}"

Additional properties to store with the molecule. Accessible via add_props attribute.

Attributes

smiles

str

Canonicalized SMILES string.

score

float

Oracle score for the molecule.

mol

rdkit.Chem.Mol

RDKit molecule object (only if SMILES is non-empty).

fingerprint

rdkit.DataStructs.ExplicitBitVect

Morgan fingerprint (radius=2, 2048 bits) for similarity calculations.

similar_mol_entries

List[MoleculeEntry]

List of similar molecules used in prompt construction.

add_props

dict

Additional properties passed via kwargs.

Methods

`eq(other)`

Compares molecules based on canonical SMILES.

mol1 = MoleculeEntry("CCO")
mol2 = MoleculeEntry("OCC")  # Same molecule, different SMILES
print(mol1 == mol2)  # True (after canonicalization)

`lt(other)`

Compares molecules by score (or SMILES if scores are equal). Used for sorting.

mol1 = MoleculeEntry("CCO", score=0.8)
mol2 = MoleculeEntry("CC", score=0.9)
print(mol1 < mol2)  # True

`hash()`

Returns hash of the SMILES string. Allows use in sets and as dictionary keys.

mol_set = {MoleculeEntry("CCO"), MoleculeEntry("CC")}

`str()` / `repr()`

Returns human-readable representation.

mol = MoleculeEntry("CCO", score=0.85)
print(mol)  # "smiles: CCO, score: 0.85"

Example Usage

from chemlactica.mol_opt.utils import MoleculeEntry

# Create a molecule
mol = MoleculeEntry(
    smiles="CC(=O)Oc1ccccc1C(=O)O",  # Aspirin
    score=0.92
)

print(mol.smiles)       # Canonical SMILES
print(mol.score)        # 0.92
print(mol.fingerprint)  # Morgan fingerprint

# Add similar molecules
similar = MoleculeEntry("CC(=O)OC1=CC=CC=C1")
mol.similar_mol_entries = [similar]

# Additional properties
mol_with_props = MoleculeEntry(
    smiles="CCO",
    logP=0.5,
    mw=46.07
)
print(mol_with_props.add_props)  # {'logP': 0.5, 'mw': 46.07}

OptimEntry

Represents an optimization entry containing multiple molecules with a target molecule to generate.

Constructor

class OptimEntry:
    def __init__(self, last_entry, mol_entries)

Parameters

last_entry

MoleculeEntry

required

The target molecule to generate. Can be None initially and set later.

mol_entries

List[MoleculeEntry]

required

Context molecules used to construct the optimization prompt.

Attributes

last_entry

MoleculeEntry

Target molecule to generate or that was generated.

mol_entries

List[MoleculeEntry]

Context molecules for the prompt.

entry_status

EntryStatus

Status indicating if this entry is for training, validation, or unassigned:

EntryStatus.none (0): Not yet assigned
EntryStatus.train (1): Training set
EntryStatus.valid (2): Validation set

Methods

`to_prompt(is_generation, include_oracle_score, config, max_score)`

Generates a formatted prompt for model training or generation. Parameters:

is_generation

bool

required

If True, creates a prompt for generation (ends with [START_SMILES]). If False, creates a training prompt with complete SMILES.

include_oracle_score

bool

required

Whether to include oracle scores in the prompt (required for rejection sampling).

config

dict

required

Configuration dictionary with strategy, eos_token, and sim_range keys.

max_score

float

required

Maximum score achieved so far, used to sample desired scores during generation.

Returns: str - Formatted prompt string Example:

entry = OptimEntry(
    last_entry=MoleculeEntry("CCO", score=0.8),
    mol_entries=[MoleculeEntry("CC", score=0.7)]
)

config = {
    "strategy": ["rej-sample-v2"],
    "eos_token": "</s>",
    "sim_range": [0.3, 0.7]
}

# Generation prompt
gen_prompt = entry.to_prompt(
    is_generation=True,
    include_oracle_score=True,
    config=config,
    max_score=0.85
)

# Training prompt
train_prompt = entry.to_prompt(
    is_generation=False,
    include_oracle_score=True,
    config=config,
    max_score=0.85
)

`contains_entry(mol_entry)`

Checks if a molecule already exists in the entry (including context and similar molecules). Parameters:

mol_entry (MoleculeEntry): Molecule to check

Returns: bool - True if molecule is already present

entry = OptimEntry(None, [MoleculeEntry("CCO")])
new_mol = MoleculeEntry("CCO")
print(entry.contains_entry(new_mol))  # True

Pool

Manages a pool of high-scoring optimization entries with diversity filtering and train/validation splitting.

Constructor

class Pool:
    def __init__(self, size, validation_perc)

Parameters

size

int

required

Maximum number of entries to maintain in the pool.

validation_perc

float

required

Percentage of pool entries reserved for validation (0.0 to 1.0).

Attributes

size

int

Maximum pool capacity.

optim_entries

List[OptimEntry]

Current optimization entries in the pool, sorted by score (descending).

num_validation_entries

int

Number of entries reserved for validation: int(size * validation_perc + 1).

Methods

`add(entries, diversity_score=1.0)`

Adds new entries to the pool with diversity filtering. Parameters:

entries

List[OptimEntry]

required

New optimization entries to add to the pool.

diversity_score

float

default:"1.0"

Maximum Tanimoto similarity threshold. Molecules more similar than this are considered duplicates and filtered out.

Behavior:

Merges new entries with existing pool
Sorts by score (descending)
Removes duplicates and overly similar molecules
Keeps top size entries
Assigns train/validation status to maintain validation_perc ratio

pool = Pool(size=100, validation_perc=0.1)

# Add entries
entries = [
    OptimEntry(MoleculeEntry("CCO", score=0.9), []),
    OptimEntry(MoleculeEntry("CC", score=0.8), [])
]
pool.add(entries)

print(len(pool))  # 2

`get_train_valid_entries()`

Returns separate lists of training and validation entries. Returns: Tuple[List[OptimEntry], List[OptimEntry]] - (train_entries, valid_entries)

train, valid = pool.get_train_valid_entries()
print(f"Training: {len(train)}, Validation: {len(valid)}")

`random_subset(subset_size)`

Samples random entries from the pool without replacement. Parameters:

subset_size (int): Number of entries to sample

Returns: List[OptimEntry] - Random sample (up to pool size)

pool = Pool(size=100, validation_perc=0.1)
# ... add entries ...

# Sample 10 random entries for prompting
subset = pool.random_subset(10)

`len()`

Returns current number of entries in the pool.

pool = Pool(size=100, validation_perc=0.1)
print(len(pool))  # 0

Example Usage

from chemlactica.mol_opt.utils import Pool, OptimEntry, MoleculeEntry

# Create pool
pool = Pool(size=50, validation_perc=0.2)  # 20% validation

# Generate and add molecules
for i in range(100):
    mol = MoleculeEntry(f"C" * (i + 1), score=i / 100)
    entry = OptimEntry(mol, [])
    pool.add([entry])

print(f"Pool size: {len(pool)}")  # 50 (max size)

# Get train/valid split
train, valid = pool.get_train_valid_entries()
print(f"Train: {len(train)}, Valid: {len(valid)}")  # ~40, ~10

# Sample for generation
subset = pool.random_subset(5)
for entry in subset:
    print(entry.last_entry)

Helper Functions

`canonicalize(smiles)`

Converts a SMILES string to its canonical form using RDKit.

from chemlactica.mol_opt.utils import canonicalize

smiles = "OCC"
canonical = canonicalize(smiles)
print(canonical)  # "CCO"

`get_morgan_fingerprint(mol)`

Generates a Morgan fingerprint for a molecule. Parameters:

mol (rdkit.Chem.Mol): RDKit molecule object

Returns: rdkit.DataStructs.ExplicitBitVect - Morgan fingerprint (radius=2, 2048 bits)

from chemlactica.mol_opt.utils import get_morgan_fingerprint
from rdkit import Chem

mol = Chem.MolFromSmiles("CCO")
fp = get_morgan_fingerprint(mol)

`get_maccs_fingerprint(mol)`

Generates a MACCS fingerprint for a molecule. Parameters:

mol (rdkit.Chem.Mol): RDKit molecule object

Returns: rdkit.DataStructs.ExplicitBitVect - MACCS keys fingerprint

from chemlactica.mol_opt.utils import get_maccs_fingerprint
from rdkit import Chem

mol = Chem.MolFromSmiles("CCO")
fp = get_maccs_fingerprint(mol)

`tanimoto_dist_func(fing1, fing2, fingerprint="morgan")`

Calculates Tanimoto similarity between two fingerprints. Parameters:

fing1: First fingerprint
fing2: Second fingerprint
fingerprint (str): Fingerprint type (currently only “morgan” is used)

Returns: float - Tanimoto similarity (0.0 to 1.0)

from chemlactica.mol_opt.utils import tanimoto_dist_func

mol1 = MoleculeEntry("CCO")
mol2 = MoleculeEntry("CCCO")
similarity = tanimoto_dist_func(mol1.fingerprint, mol2.fingerprint)
print(f"Similarity: {similarity:.3f}")

`set_seed(seed_value)`

Sets random seeds for reproducibility across random, numpy, and PyTorch. Parameters:

seed_value (int): Random seed value

from chemlactica.mol_opt.utils import set_seed

set_seed(42)

`generate_random_number(lower, upper)`

Generates a random float between lower and upper bounds. Parameters:

lower (float): Lower bound (inclusive)
upper (float): Upper bound (inclusive)

Returns: float - Random number in [lower, upper]

from chemlactica.mol_opt.utils import generate_random_number

# Generate random SAS score
sas = generate_random_number(2.0, 3.0)

`create_prompt_with_similars(mol_entry, sim_range=None)`

Creates a prompt string with similar molecules for generation. Parameters:

mol_entry (MoleculeEntry): Molecule entry with similar molecules
sim_range (list[float], optional): [min_sim, max_sim] range for similarity values

Returns: str - Formatted prompt with SIMILAR tags Source: mol_opt/utils.py:112

from chemlactica.mol_opt.utils import create_prompt_with_similars, MoleculeEntry

mol = MoleculeEntry("CCO", score=0.85)
mol.similar_mol_entries = [
    MoleculeEntry("CCCO", score=0.80),
    MoleculeEntry("CC(C)O", score=0.75)
]

prompt = create_prompt_with_similars(mol, sim_range=[0.4, 0.7])
# Generates prompt with [SIMILAR]CCCO 0.65[/SIMILAR][SIMILAR]CC(C)O 0.58[/SIMILAR]

Core API

Training

Generation

MoleculeEntry

Constructor

Parameters

Attributes

Methods

`eq(other)`

`lt(other)`

`hash()`

`str()` / `repr()`

Example Usage

OptimEntry

Constructor

Parameters

Attributes

Methods

`to_prompt(is_generation, include_oracle_score, config, max_score)`

`contains_entry(mol_entry)`

Pool

Constructor

Parameters

Attributes

Methods

`add(entries, diversity_score=1.0)`

`get_train_valid_entries()`

`random_subset(subset_size)`

`len()`

Example Usage

Helper Functions

`canonicalize(smiles)`

`get_morgan_fingerprint(mol)`

`get_maccs_fingerprint(mol)`

`tanimoto_dist_func(fing1, fing2, fingerprint="morgan")`

`set_seed(seed_value)`

`generate_random_number(lower, upper)`

`create_prompt_with_similars(mol_entry, sim_range=None)`

Build docs developers (and LLMs) love

Core API

Training

Generation

​MoleculeEntry

​Constructor

​Parameters

​Attributes

​Methods

​__eq__(other)

​__lt__(other)

​__hash__()

​__str__() / __repr__()

​Example Usage

​OptimEntry

​Constructor

​Parameters

​Attributes

​Methods

​to_prompt(is_generation, include_oracle_score, config, max_score)

​contains_entry(mol_entry)

​Pool

​Constructor

​Parameters

​Attributes

​Methods

​add(entries, diversity_score=1.0)

​get_train_valid_entries()

​random_subset(subset_size)

​__len__()

​Example Usage

​Helper Functions

​canonicalize(smiles)

​get_morgan_fingerprint(mol)

​get_maccs_fingerprint(mol)

​tanimoto_dist_func(fing1, fing2, fingerprint="morgan")

​set_seed(seed_value)

​generate_random_number(lower, upper)

​create_prompt_with_similars(mol_entry, sim_range=None)

Build docs developers (and LLMs) love

MoleculeEntry

Constructor

Parameters

Attributes

Methods

`eq(other)`

`lt(other)`

`hash()`

`str()` / `repr()`

Example Usage

OptimEntry

Constructor

Parameters

Attributes

Methods

`to_prompt(is_generation, include_oracle_score, config, max_score)`

`contains_entry(mol_entry)`

Pool

Constructor

Parameters

Attributes

Methods

`add(entries, diversity_score=1.0)`

`get_train_valid_entries()`

`random_subset(subset_size)`

`len()`

Example Usage

Helper Functions

`canonicalize(smiles)`

`get_morgan_fingerprint(mol)`

`get_maccs_fingerprint(mol)`

`tanimoto_dist_func(fing1, fing2, fingerprint="morgan")`

`set_seed(seed_value)`

`generate_random_number(lower, upper)`

`create_prompt_with_similars(mol_entry, sim_range=None)`