Introduction to ChemLactica
ChemLactica and Chemma are a family of large language models specifically designed to understand and generate small organic molecules using SMILES notation. These models achieve state-of-the-art results on molecular optimization benchmarks while being accessible and easy to use.What is ChemLactica?
ChemLactica is a family of models that “understand” small organic molecules (SMILES), their basic properties (molecular weight, QED, SAS, TPSA, CLogP, etc.), and similarities between molecules (Tanimoto over ECFC4).Chemlactica-125M
Compact model trained on Meta’s Galactica-125M, ideal for resource-constrained environments
Chemlactica-1.3B
Larger model built on Galactica-1.3B for enhanced molecular generation
Chemma-2B
Most powerful model built on Google’s Gemma-2B architecture
Training Dataset
All models trained on 40B tokens covering 100M+ molecules from PubChem
Key features
SMILES-based molecular generation
Generate molecules with specific properties using natural language-like prompts:Property prediction
The models can be easily fine-tuned to perform property prediction tasks, achieving ~0.3 RMSE on FreeSolv from MoleculeNet.State-of-the-art molecular optimization
When wrapped in a genetic-like optimization algorithm, ChemLactica beats all major molecular optimization benchmarks:Practical Molecular Optimization benchmark: Score of 17.5 vs 16.2 (previous SOTA: Genetic-guided GFlowNets)
Docking optimization with AutoDock Vina: 3-4x fewer oracle calls needed to generate 100 high-quality molecules compared to Beam Enumeration
QED optimization from RetMol paper: 99% success rate with 10K oracle calls using Chemlactica-125M (vs. 96% with 50K calls in the original paper)
Get started
Installation
Set up your environment with conda and install dependencies
Quick start
Generate your first optimized molecules in minutes
Paper
Read the full research paper on small molecule optimization with LLMs
GitHub
Explore the source code and contribute to the project
Use cases
ChemLactica is designed for researchers and practitioners working on:- Drug discovery: Generate molecules optimized for specific biological targets
- Materials science: Design molecules with desired physical properties
- Chemical space exploration: Discover novel molecular structures
- Property prediction: Train models to predict molecular properties
- Lead optimization: Refine existing molecules to improve key characteristics