optimize() function implements a genetic-like optimization algorithm that iteratively generates and refines molecules based on oracle feedback.
The optimize() Function
Located inchemlactica/mol_opt/optimization.py, this is the main entry point:
Parameters
The ChemLactica model (125M, 1.3B, or Chemma-2B)
Tokenizer corresponding to the model
Your custom oracle implementing the oracle interface
Hyperparameter configuration dictionary (from YAML)
Optional additional properties to include in prompts (e.g., molecular weight, logP)
Optional function to validate generated SMILES before scoring
Algorithm Workflow
The optimization process follows these steps:Initialize Pool
Create an empty pool to store the top-performing moleculesThe pool maintains:
- Top
pool_sizemolecules sorted by score - Training/validation split for fine-tuning
- Diversity filtering to avoid duplicates
Generation Loop
Generate Each prompt includes:c) Parse and validate:Extract SMILES between
num_gens_per_iter new molecules per iteration:a) Create prompts from pool molecules:- Similar molecules with Tanimoto similarities
- Desired oracle score (random between
max_scoreandmax_possible_oracle_score) - Additional properties if specified
[START_SMILES] and [END_SMILES] tagsUpdate Pool
Add new molecules to the pool:The pool:
- Sorts molecules by score (descending)
- Removes duplicates and highly similar molecules
- Keeps only top
pool_sizemolecules - Maintains train/validation split
Adaptive Fine-tuning (Optional)
If using b) Fine-tune model:The model learns to generate molecules that match the oracle’s preferences.
rej-sample-v2 strategy and no improvement for train_tol_level iterations:a) Prepare datasets:Adjust Temperature
Increase sampling temperature over time for exploration:Starts at
generation_temperature[0] and increases to generation_temperature[1]Key Components
Pool Management
ThePool class (mol_opt/utils.py:88) maintains top molecules:
Prompt Construction
TheOptimEntry.to_prompt() method creates prompts for generation:
During generation (is_generation=True):
is_generation=False):
Molecule Entry
TheMoleculeEntry class represents a molecule:
Tolerance Mechanism
The algorithm tracks iterations without improvement:- Continued exploration when making progress
- Model adaptation when stuck
- Efficient use of oracle budget
Generation Temperature Schedule
Temperature increases linearly to encourage exploration:Diversity Filtering
The pool removes molecules that are too similar:diversity_score in the pool to filter more aggressively.
Algorithm Pseudocode
Here’s the complete algorithm in pseudocode:Comparison with Other Approaches
vs. Genetic Algorithms
vs. Genetic Algorithms
Similarities:
- Maintains a pool of top candidates
- Iterative generation and selection
- Fitness-based ranking
- Uses LLM generation instead of crossover/mutation
- Prompts guide generation with similarity constraints
- Optional model fine-tuning for adaptation
vs. Reinforcement Learning
vs. Reinforcement Learning
Similarities:
- Oracle acts as reward function
- Model learns to maximize rewards
- Exploration-exploitation tradeoff
- No RL training loop (uses supervised fine-tuning)
- Simpler implementation
- Faster convergence on many benchmarks
vs. Bayesian Optimization
vs. Bayesian Optimization
Similarities:
- Efficient use of oracle budget
- Balances exploration and exploitation
- No surrogate model
- LLM directly generates candidates
- Better handles discrete molecular space
Performance Characteristics
Memory Usage
- Pool: ~10-100 molecules (negligible)
- Model: 125M (~0.5GB), 1.3B (~5GB), 2B (~8GB)
- Fine-tuning: +2x model size during training
Oracle Calls
- Typical budget: 1K - 10K calls
- Batch size: 200 molecules/iteration
- PMO benchmark: ~5K calls to SOTA
Runtime
- Generation: ~1-2 sec/batch (200 molecules)
- Fine-tuning: ~30 sec - 2 min per round
- Total: Minutes to hours depending on oracle complexity
Success Rate
- QED optimization: 99% success (10K calls)
- PMO benchmark: 17.5 avg score
- Docking: 3-4x fewer calls than SOTA
Next Steps
Configure Hyperparameters
Learn how to tune pool size, temperature, and fine-tuning settings
Complete Examples
See full working examples with different oracles