Skip to main content

Introduction to ChemLactica

ChemLactica and Chemma are a family of large language models specifically designed to understand and generate small organic molecules using SMILES notation. These models achieve state-of-the-art results on molecular optimization benchmarks while being accessible and easy to use.

What is ChemLactica?

ChemLactica is a family of models that “understand” small organic molecules (SMILES), their basic properties (molecular weight, QED, SAS, TPSA, CLogP, etc.), and similarities between molecules (Tanimoto over ECFC4).

Chemlactica-125M

Compact model trained on Meta’s Galactica-125M, ideal for resource-constrained environments

Chemlactica-1.3B

Larger model built on Galactica-1.3B for enhanced molecular generation

Chemma-2B

Most powerful model built on Google’s Gemma-2B architecture

Training Dataset

All models trained on 40B tokens covering 100M+ molecules from PubChem

Key features

SMILES-based molecular generation

Generate molecules with specific properties using natural language-like prompts:
prompt = "</s>[SAS]2.25[/SAS][SIMILAR]CC(=O)OC1=CC=CC=C1C(=O)O 0.62[/SIMILAR][START_SMILES]"
# Generates a molecule with ~2.25 SAS score and ~0.62 similarity to aspirin
This prompt will generate a molecule that has approximately 2.25 SAS (Synthetic Accessibility Score) and approximately 0.62 Tanimoto similarity to the reference molecule (aspirin in this example).

Property prediction

The models can be easily fine-tuned to perform property prediction tasks, achieving ~0.3 RMSE on FreeSolv from MoleculeNet.

State-of-the-art molecular optimization

When wrapped in a genetic-like optimization algorithm, ChemLactica beats all major molecular optimization benchmarks:
Practical Molecular Optimization benchmark: Score of 17.5 vs 16.2 (previous SOTA: Genetic-guided GFlowNets)
Docking optimization with AutoDock Vina: 3-4x fewer oracle calls needed to generate 100 high-quality molecules compared to Beam Enumeration
QED optimization from RetMol paper: 99% success rate with 10K oracle calls using Chemlactica-125M (vs. 96% with 50K calls in the original paper)

Get started

Installation

Set up your environment with conda and install dependencies

Quick start

Generate your first optimized molecules in minutes

Paper

Read the full research paper on small molecule optimization with LLMs

GitHub

Explore the source code and contribute to the project

Use cases

ChemLactica is designed for researchers and practitioners working on:
  • Drug discovery: Generate molecules optimized for specific biological targets
  • Materials science: Design molecules with desired physical properties
  • Chemical space exploration: Discover novel molecular structures
  • Property prediction: Train models to predict molecular properties
  • Lead optimization: Refine existing molecules to improve key characteristics

Research

For detailed technical information, benchmarks, and methodology, read the paper: Small Molecule Optimization with Large Language Models We look forward to the community utilizing these models for solving various problems in molecular design.

Build docs developers (and LLMs) love