Overview
ChemLactica models understand and can predict a wide range of molecular properties. These properties are encoded using special tags in the training data and can be used to guide molecule generation and optimization.Core Properties
These are the most commonly used properties in ChemLactica:QED (Quantitative Estimate of Drug-likeness)
A measure of how drug-like a molecule is, ranging from 0 to 1.
chemlactica/generation/rejection_sampling_utils.py
[QED]0.95[/QED]
SAS (Synthetic Accessibility Score)
Estimates how difficult it is to synthesize a molecule, ranging from 1 (easy) to 10 (very difficult).
[SAS]2.25[/SAS]
Example from README:
TPSA (Topological Polar Surface Area)
The surface area of polar atoms in a molecule (Ų), important for predicting drug absorption.
chemlactica/mol_opt/example_run.py
TPSA values under 140 Ų typically indicate good oral bioavailability. Lower TPSA generally means better membrane permeability.
[TPSA]63.06[/TPSA]
CLogP (Partition Coefficient)
Measures lipophilicity (how well a molecule dissolves in fats vs water). Important for drug absorption and distribution.
[CLOGP]2.45[/CLOGP]
Molecular Weight
The exact molecular weight in Daltons (Da).
chemlactica/mol_opt/example_run.py
[WEIGHT]325.10[/WEIGHT]
Structural Properties
These properties describe the molecular structure:Hydrogen Bond Donors and Acceptors
Number of hydrogen bond donor groups (like -OH, -NH)
Number of hydrogen bond acceptor groups (like =O, -N-)
Atom Counts
Number of non-carbon, non-hydrogen atoms (heteroatoms)
Total number of non-hydrogen atoms
Number of nitrogen and oxygen atoms
Number of NH and OH groups
Bond Properties
Number of rotatable bonds, indicates molecular flexibility
Fraction of sp³ hybridized carbons (saturated carbons)
Ring Properties
Detailed information about ring systems in molecules:Basic Ring Counts
chemlactica/mol_opt/example_run.py
Total number of rings in the molecule
Number of aromatic rings (like benzene)
Number of saturated (non-aromatic) rings
Number of aliphatic (non-aromatic) rings
Hetero vs Carbocyclic Rings
Aromatic Rings
Aromatic Rings
Saturated Rings
Saturated Rings
Molecular Similarity
Tanimoto Similarity
ChemLactica uses Tanimoto similarity over Morgan fingerprints (ECFC4) to measure molecular similarity:chemlactica/mol_opt/utils.py
Morgan fingerprints (also called ECFC4 - Extended Connectivity Fingerprints, diameter 4) capture the local chemical environment around each atom up to 2 bonds away.
[SIMILAR]CC(=O)OC1=CC=CC=C1C(=O)O 0.62[/SIMILAR]
Similarity Ranges
- 0.0 - 0.3: Very different molecules
- 0.3 - 0.5: Some structural similarity
- 0.5 - 0.7: Moderate similarity
- 0.7 - 0.9: High similarity
- 0.9 - 1.0: Very similar (1.0 = identical)
Property Tags Reference
Complete list from the codebase:chemlactica/utils/text_format_utils.py
Custom Properties
You can also define custom properties for fine-tuning and optimization:Property-Guided Generation
Example: Generate High QED Molecule
chemlactica/generation/rejection_sampling_utils.py
Example: TPSA + Weight Oracle
From the optimization example:chemlactica/mol_opt/example_run.py
Property Formatting
Next Steps
Molecular Optimization
Use properties for optimization
SMILES Format
Learn about molecular representation
Model Architectures
Explore the models
Property Prediction
Fine-tune for predictions