Overview
ChemLactica provides a family of large language models specifically designed to understand and generate small organic molecules. These models are trained to comprehend SMILES representations, molecular properties, and relationships between molecules.Model Family
Chemlactica Models
The Chemlactica models are built on top of Meta’s Galactica architecture, a decoder-only transformer model optimized for scientific knowledge.Chemlactica-125M
Lightweight model with 125 million parameters, ideal for fast inference and fine-tuning
Chemlactica-1.3B
Larger model with 1.3 billion parameters for enhanced performance on complex tasks
Architecture Details (Chemlactica-125M)
chemlactica/config/config_yamls/galactica_125m_pretrain_config.yaml
The 125M model uses 12 transformer layers with 12 attention heads each, supporting a maximum context length of 2048 tokens.
Chemma Models
Chemma-2B is built on top of Google’s Gemma-2B architecture, offering state-of-the-art performance for molecular tasks.Chemma-2B
Advanced 2 billion parameter model based on Gemma architecture
Architecture Details (Chemma-2B)
chemlactica/config/config_yamls/gemma_2b_pretrain_config.yaml
Chemma-2B uses 18 transformer layers and a significantly larger vocabulary (256K tokens) compared to Chemlactica models.
Model Capabilities
All models in the ChemLactica family understand:Molecular Structure
Molecular Structure
- SMILES notation for small organic molecules
- Canonical and non-canonical SMILES representations
- Molecular graphs and connectivity
Molecular Properties
Molecular Properties
- QED: Quantitative Estimate of Drug-likeness
- SAS: Synthetic Accessibility Score
- TPSA: Topological Polar Surface Area
- CLogP: Partition coefficient (lipophilicity)
- Molecular Weight: Exact molecular weight
- And many more properties (see Molecular Properties)
Molecular Similarity
Molecular Similarity
- Tanimoto similarity over ECFC4 (Morgan) fingerprints
- Structure-based similarity comparisons
- Finding related molecules in chemical space
Loading Pre-trained Models
Using Hugging Face Transformers
Configuration for Optimization
When using models for molecular optimization:chemlactica/mol_opt/chemlactica_125m_hparams.yaml
Model Performance
Property Prediction
The models can be fine-tuned for property prediction tasks with impressive results:- FreeSolv: ~0.3 RMSE (state-of-the-art on MoleculeNet benchmark)
- Fine-tuning requires minimal data and training time
Molecular Optimization
When wrapped in a genetic-like optimization algorithm, the models achieve:Practical Molecular Optimization
Score: 17.5 vs 16.2 (previous SOTA)Outperforms Genetic-guided GFlowNets
QED Optimization
99% success rate with 10K oracle callsvs 96% with 50K calls (RetMol paper)
Docking Optimization
For AutoDock Vina docking optimization:- 3-4x fewer oracle calls needed to generate 100 good molecules
- Compared to previous SOTA (Beam Enumeration)
Model Selection Guide
Next Steps
SMILES Format
Learn about molecular representation
Molecular Properties
Explore calculated properties
Training Data
Understand the training corpus
Quick Start
Start using the models