Skip to main content

Overview

ChemLactica models understand and can predict a wide range of molecular properties. These properties are encoded using special tags in the training data and can be used to guide molecule generation and optimization.

Core Properties

These are the most commonly used properties in ChemLactica:

QED (Quantitative Estimate of Drug-likeness)

QED
float
required
A measure of how drug-like a molecule is, ranging from 0 to 1.
chemlactica/generation/rejection_sampling_utils.py
from rdkit.Chem.QED import qed

drug_likeness = qed(mol)
Higher QED values (closer to 1) indicate more drug-like molecules. Typically, values above 0.5 are considered favorable for drug candidates.
Tag format: [QED]0.95[/QED]

SAS (Synthetic Accessibility Score)

SAS
float
required
Estimates how difficult it is to synthesize a molecule, ranging from 1 (easy) to 10 (very difficult).
Tag format: [SAS]2.25[/SAS] Example from README:
# Generate molecule with SAS around 2.25
prompt = "</s>[SAS]2.25[/SAS][START_SMILES]"

TPSA (Topological Polar Surface Area)

TPSA
float
required
The surface area of polar atoms in a molecule (Ų), important for predicting drug absorption.
chemlactica/mol_opt/example_run.py
from rdkit.Chem import rdMolDescriptors

tpsa = rdMolDescriptors.CalcTPSA(molecule.mol)
TPSA values under 140 Ų typically indicate good oral bioavailability. Lower TPSA generally means better membrane permeability.
Tag format: [TPSA]63.06[/TPSA]

CLogP (Partition Coefficient)

CLOGP
float
required
Measures lipophilicity (how well a molecule dissolves in fats vs water). Important for drug absorption and distribution.
Tag format: [CLOGP]2.45[/CLOGP]
Optimal CLogP values for drugs typically range from 0 to 5. Too high indicates poor water solubility; too low indicates poor membrane permeability.

Molecular Weight

WEIGHT
float
required
The exact molecular weight in Daltons (Da).
chemlactica/mol_opt/example_run.py
weight = rdMolDescriptors.CalcExactMolWt(molecule.mol)
Tag format: [WEIGHT]325.10[/WEIGHT]

Structural Properties

These properties describe the molecular structure:

Hydrogen Bond Donors and Acceptors

NUMHDONORS
int
Number of hydrogen bond donor groups (like -OH, -NH)
NUMHACCEPTORS
int
Number of hydrogen bond acceptor groups (like =O, -N-)
Tag format:
[NUMHDONORS]2[/NUMHDONORS]
[NUMHACCEPTORS]4[/NUMHACCEPTORS]

Atom Counts

NUMHETEROATOMS
int
Number of non-carbon, non-hydrogen atoms (heteroatoms)
HEAVYATOMCOUNT
int
Total number of non-hydrogen atoms
NOCOUNT
int
Number of nitrogen and oxygen atoms
NHOHCOUNT
int
Number of NH and OH groups
Tag format:
[NUMHETEROATOMS]5[/NUMHETEROATOMS]
[HEAVYATOMCOUNT]24[/HEAVYATOMCOUNT]

Bond Properties

NUMROTATABLEBONDS
int
Number of rotatable bonds, indicates molecular flexibility
FRACTIONCSP3
float
Fraction of sp³ hybridized carbons (saturated carbons)
Tag format:
[NUMROTATABLEBONDS]8[/NUMROTATABLEBONDS]
[FRACTIONCSP3]0.32[/FRACTIONCSP3]

Ring Properties

Detailed information about ring systems in molecules:

Basic Ring Counts

chemlactica/mol_opt/example_run.py
from rdkit.Chem import rdMolDescriptors

num_rings = rdMolDescriptors.CalcNumRings(molecule.mol)
RINGCOUNT
int
Total number of rings in the molecule
NUMAROMATICRINGS
int
Number of aromatic rings (like benzene)
NUMSATURATEDRINGS
int
Number of saturated (non-aromatic) rings
NUMALIPHATICRINGS
int
Number of aliphatic (non-aromatic) rings

Hetero vs Carbocyclic Rings

NUMAROMATICHETEROCYCLES
int
Aromatic rings containing heteroatoms (like pyridine)
NUMAROMATICCARBOCYCLES
int
Aromatic rings with only carbon atoms (like benzene)
NUMSATURATEDHETEROCYCLES
int
Saturated rings containing heteroatoms (like piperidine)
NUMSATURATEDCARBOCYCLES
int
Saturated rings with only carbon atoms (like cyclohexane)
NUMALIPHATICHETEROCYCLES
int
Aliphatic rings containing heteroatoms
NUMALIPHATICCARBOCYCLES
int
Aliphatic rings with only carbon atoms

Molecular Similarity

Tanimoto Similarity

ChemLactica uses Tanimoto similarity over Morgan fingerprints (ECFC4) to measure molecular similarity:
chemlactica/mol_opt/utils.py
from rdkit import DataStructs
from rdkit.Chem import AllChem

def get_morgan_fingerprint(mol):
    return AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=2048)

def tanimoto_dist_func(fing1, fing2, fingerprint: str = "morgan"):
    return DataStructs.TanimotoSimilarity(fing1, fing2)
Morgan fingerprints (also called ECFC4 - Extended Connectivity Fingerprints, diameter 4) capture the local chemical environment around each atom up to 2 bonds away.
Tag format: [SIMILAR]CC(=O)OC1=CC=CC=C1C(=O)O 0.62[/SIMILAR]

Similarity Ranges

  • 0.0 - 0.3: Very different molecules
  • 0.3 - 0.5: Some structural similarity
  • 0.5 - 0.7: Moderate similarity
  • 0.7 - 0.9: High similarity
  • 0.9 - 1.0: Very similar (1.0 = identical)

Property Tags Reference

Complete list from the codebase:
chemlactica/utils/text_format_utils.py
SPECIAL_TAGS = {
    "SMILES": {"start": "[START_SMILES]", "end": "[END_SMILES]"},
    "similarity": {"start": "[SIMILAR]", "end": "[/SIMILAR]", "type": float},
    "SAS": {"start": "[SAS]", "end": "[/SAS]", "type": float},
    "WEIGHT": {"start": "[WEIGHT]", "end": "[/WEIGHT]", "type": float},
    "TPSA": {"start": "[TPSA]", "end": "[/TPSA]", "type": float},
    "CLOGP": {"start": "[CLOGP]", "end": "[/CLOGP]", "type": float},
    "QED": {"start": "[QED]", "end": "[/QED]", "type": float},
    "FRACTIONCSP3": {"start": "[FRACTIONCSP3]", "end": "[/FRACTIONCSP3]", "type": float},
    # ... and many more
}

Custom Properties

You can also define custom properties for fine-tuning and optimization:
# Generic property tag
"[PROPERTY]oracle_score 0.85[/PROPERTY]"

# Used in rejection sampling for optimization
"[PROPERTY]oracle_score {mol_entry.score:.2f}[/PROPERTY]"

Property-Guided Generation

Example: Generate High QED Molecule

chemlactica/generation/rejection_sampling_utils.py
input_text = f"</s>[QED]{random.uniform(0.9, 0.99):.2f}[/QED][START_SMILES]"

Example: TPSA + Weight Oracle

From the optimization example:
chemlactica/mol_opt/example_run.py
class TPSA_Weight_Oracle:
    def __call__(self, molecules: List[MoleculeEntry]):
        oracle_scores = []
        for molecule in molecules:
            try:
                tpsa = rdMolDescriptors.CalcTPSA(molecule.mol)
                oracle_score = tpsa
                weight = rdMolDescriptors.CalcExactMolWt(molecule.mol)
                num_rings = rdMolDescriptors.CalcNumRings(molecule.mol)
                
                # Apply constraints
                if weight >= 350:
                    oracle_score = 0
                if num_rings < 2:
                    oracle_score = 0
            except Exception as e:
                oracle_score = 0
                
            oracle_scores.append(oracle_score)
        return oracle_scores

Property Formatting

Float properties are always formatted to 2 decimal places in the training data:
chemlactica/utils/text_format_utils.py
if SPECIAL_TAGS[key].get("type") is float:
    value = "{:.2f}".format(float(value))
    assert len(value.split(".")[-1]) == 2

Next Steps

Molecular Optimization

Use properties for optimization

SMILES Format

Learn about molecular representation

Model Architectures

Explore the models

Property Prediction

Fine-tune for predictions

Build docs developers (and LLMs) love