Skip to main content
ChemLactica’s molecular optimization framework combines large language models with a genetic-like optimization algorithm to achieve state-of-the-art results on challenging molecular design benchmarks.

How It Works

The optimization process uses the model’s ability to generate molecules conditioned on:
  • Similar molecules with specified Tanimoto similarity scores
  • Desired properties like SAS, QED, molecular weight, etc.
  • Oracle scores indicating the quality of generated molecules
The algorithm iteratively:
  1. Generates new molecules from prompts based on high-scoring molecules in the pool
  2. Evaluates them using a custom oracle function
  3. Maintains a pool of top-performing molecules
  4. Fine-tunes the model on successful molecules (optional)

Benchmark Results

ChemLactica achieves state-of-the-art performance across multiple optimization tasks:

Practical Molecular Optimization (PMO)

ChemLactica

17.5 average score

Previous SOTA

16.2 (Genetic-guided GFlowNets)
The Practical Molecular Optimization benchmark tests optimization across multiple drug-like property objectives.

Docking Optimization

AutoDock Vina Optimization

3-4x fewer oracle calls to generate 100 good molecules compared to Beam Enumeration (previous SOTA)
Optimizing molecular docking scores is crucial for drug discovery. ChemLactica significantly reduces the computational cost while maintaining quality.

QED Optimization

From the RetMol paper benchmark:

ChemLactica-125M

99% success rate with 10K oracle calls

RetMol (Original)

96% success rate with 50K oracle calls
ChemLactica-125M achieves better results with 5x fewer oracle calls.

Key Features

Custom Oracles

Define any objective function to optimize molecules for your specific use case

Flexible Prompting

Control molecular properties through structured prompts with similarity constraints

Adaptive Fine-tuning

Optional rejection sampling strategy that fine-tunes the model during optimization

Efficient Search

Genetic-like algorithm with diversity filtering to explore chemical space effectively

Optimization Strategies

ChemLactica supports two optimization strategies:

Default Strategy

Uses the pre-trained model without additional fine-tuning:
strategy: [default]
Pros:
  • Faster iteration times
  • No need for training infrastructure
  • Works well for objectives similar to pre-training data
Cons:
  • May not adapt to highly specialized objectives

Rejection Sampling (rej-sample-v2)

Fine-tunes the model during optimization on high-scoring molecules:
strategy: [rej-sample-v2]

rej_sample_config:
  train_tol_level: 3
  num_train_epochs: 5
  # ... additional config
Pros:
  • Adapts to specific optimization objectives
  • Achieves better results on challenging tasks
  • Model learns to generate molecules matching the oracle
Cons:
  • Requires GPU memory for training
  • Slower iteration times
  • Needs careful hyperparameter tuning
The rej-sample-v2 strategy is recommended for achieving state-of-the-art results on challenging benchmarks. Use default strategy for faster exploration or when GPU memory is limited.

Quick Start

Here’s a minimal example to get started:
from transformers import AutoModelForCausalLM, AutoTokenizer
from chemlactica.mol_opt.optimization import optimize
import yaml

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("yerevann/chemlactica-125m")
tokenizer = AutoTokenizer.from_pretrained("yerevann/chemlactica-125m")

# Load configuration
config = yaml.safe_load(open("config.yaml"))

# Create your oracle (see Oracle Design page)
oracle = YourCustomOracle(max_oracle_calls=1000)
config["max_possible_oracle_score"] = oracle.max_possible_oracle_score

# Run optimization
optimize(model, tokenizer, oracle, config)

Next Steps

Design Custom Oracles

Learn how to implement custom oracle functions for your objectives

Algorithm Details

Understand the optimization algorithm workflow

Configure Hyperparameters

Tune the optimization process for your use case

See Examples

Explore complete working examples

Citation

If you use ChemLactica for molecular optimization, please cite:
@article{chemlactica2024,
  title={Small Molecule Optimization with Large Language Models},
  author={YerevaNN Research},
  year={2024},
  url={https://yerevann.com/papers/small-molecule-optimization-with-large-language-models.pdf}
}

Build docs developers (and LLMs) love