Introduction to ChemLactica

ChemLactica and Chemma are a family of large language models specifically designed to understand and generate small organic molecules using SMILES notation. These models achieve state-of-the-art results on molecular optimization benchmarks while being accessible and easy to use.

What is ChemLactica?

ChemLactica is a family of models that “understand” small organic molecules (SMILES), their basic properties (molecular weight, QED, SAS, TPSA, CLogP, etc.), and similarities between molecules (Tanimoto over ECFC4).

Chemlactica-125M

Compact model trained on Meta’s Galactica-125M, ideal for resource-constrained environments

Chemlactica-1.3B

Larger model built on Galactica-1.3B for enhanced molecular generation

Chemma-2B

Most powerful model built on Google’s Gemma-2B architecture

Training Dataset

All models trained on 40B tokens covering 100M+ molecules from PubChem

Key features

SMILES-based molecular generation

Generate molecules with specific properties using natural language-like prompts:

prompt = "</s>[SAS]2.25[/SAS][SIMILAR]CC(=O)OC1=CC=CC=C1C(=O)O 0.62[/SIMILAR][START_SMILES]"
# Generates a molecule with ~2.25 SAS score and ~0.62 similarity to aspirin

This prompt will generate a molecule that has approximately 2.25 SAS (Synthetic Accessibility Score) and approximately 0.62 Tanimoto similarity to the reference molecule (aspirin in this example).

Property prediction

The models can be easily fine-tuned to perform property prediction tasks, achieving ~0.3 RMSE on FreeSolv from MoleculeNet.

State-of-the-art molecular optimization

When wrapped in a genetic-like optimization algorithm, ChemLactica beats all major molecular optimization benchmarks:

Practical Molecular Optimization benchmark: Score of 17.5 vs 16.2 (previous SOTA: Genetic-guided GFlowNets)

Docking optimization with AutoDock Vina: 3-4x fewer oracle calls needed to generate 100 high-quality molecules compared to Beam Enumeration

QED optimization from RetMol paper: 99% success rate with 10K oracle calls using Chemlactica-125M (vs. 96% with 50K calls in the original paper)

Get started

Installation

Set up your environment with conda and install dependencies

Quick start

Generate your first optimized molecules in minutes

Paper

Read the full research paper on small molecule optimization with LLMs

GitHub

Explore the source code and contribute to the project

Use cases

ChemLactica is designed for researchers and practitioners working on:

Drug discovery: Generate molecules optimized for specific biological targets
Materials science: Design molecules with desired physical properties
Chemical space exploration: Discover novel molecular structures
Property prediction: Train models to predict molecular properties
Lead optimization: Refine existing molecules to improve key characteristics

Research

For detailed technical information, benchmarks, and methodology, read the paper: Small Molecule Optimization with Large Language Models We look forward to the community utilizing these models for solving various problems in molecular design.

Getting Started

Core Concepts

Training

Molecular Optimization

Generation

Guides

Introduction to ChemLactica

Introduction to ChemLactica

What is ChemLactica?

Chemlactica-125M

Chemlactica-1.3B

Chemma-2B

Training Dataset

Key features

SMILES-based molecular generation

Property prediction

State-of-the-art molecular optimization

Get started

Installation

Quick start

Paper

GitHub

Use cases

Research

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Training

Molecular Optimization

Generation

Guides

​Introduction to ChemLactica

​What is ChemLactica?

Chemlactica-125M

Chemlactica-1.3B

Chemma-2B

Training Dataset

​Key features

​SMILES-based molecular generation

​Property prediction

​State-of-the-art molecular optimization

​Get started

Installation

Quick start

Paper

GitHub

​Use cases

​Research

Build docs developers (and LLMs) love

Introduction to ChemLactica

What is ChemLactica?

Key features

SMILES-based molecular generation

Property prediction

State-of-the-art molecular optimization

Get started

Use cases

Research