Molecular Properties

Overview

ChemLactica models understand and can predict a wide range of molecular properties. These properties are encoded using special tags in the training data and can be used to guide molecule generation and optimization.

Core Properties

These are the most commonly used properties in ChemLactica:

QED (Quantitative Estimate of Drug-likeness)

QED

float

required

A measure of how drug-like a molecule is, ranging from 0 to 1.

chemlactica/generation/rejection_sampling_utils.py

from rdkit.Chem.QED import qed

drug_likeness = qed(mol)

Higher QED values (closer to 1) indicate more drug-like molecules. Typically, values above 0.5 are considered favorable for drug candidates.

Tag format: [QED]0.95[/QED]

SAS (Synthetic Accessibility Score)

SAS

float

required

Estimates how difficult it is to synthesize a molecule, ranging from 1 (easy) to 10 (very difficult).

Tag format: [SAS]2.25[/SAS] Example from README:

# Generate molecule with SAS around 2.25
prompt = "</s>[SAS]2.25[/SAS][START_SMILES]"

TPSA (Topological Polar Surface Area)

TPSA

float

required

The surface area of polar atoms in a molecule (Ų), important for predicting drug absorption.

chemlactica/mol_opt/example_run.py

from rdkit.Chem import rdMolDescriptors

tpsa = rdMolDescriptors.CalcTPSA(molecule.mol)

TPSA values under 140 Ų typically indicate good oral bioavailability. Lower TPSA generally means better membrane permeability.

Tag format: [TPSA]63.06[/TPSA]

CLogP (Partition Coefficient)

CLOGP

float

required

Measures lipophilicity (how well a molecule dissolves in fats vs water). Important for drug absorption and distribution.

Tag format: [CLOGP]2.45[/CLOGP]

Optimal CLogP values for drugs typically range from 0 to 5. Too high indicates poor water solubility; too low indicates poor membrane permeability.

Molecular Weight

WEIGHT

float

required

The exact molecular weight in Daltons (Da).

chemlactica/mol_opt/example_run.py

weight = rdMolDescriptors.CalcExactMolWt(molecule.mol)

Tag format: [WEIGHT]325.10[/WEIGHT]

Structural Properties

These properties describe the molecular structure:

Hydrogen Bond Donors and Acceptors

NUMHDONORS

int

Number of hydrogen bond donor groups (like -OH, -NH)

NUMHACCEPTORS

int

Number of hydrogen bond acceptor groups (like =O, -N-)

Tag format:

[NUMHDONORS]2[/NUMHDONORS]
[NUMHACCEPTORS]4[/NUMHACCEPTORS]

Atom Counts

NUMHETEROATOMS

int

Number of non-carbon, non-hydrogen atoms (heteroatoms)

HEAVYATOMCOUNT

int

Total number of non-hydrogen atoms

NOCOUNT

int

Number of nitrogen and oxygen atoms

NHOHCOUNT

int

Number of NH and OH groups

Tag format:

[NUMHETEROATOMS]5[/NUMHETEROATOMS]
[HEAVYATOMCOUNT]24[/HEAVYATOMCOUNT]

Bond Properties

NUMROTATABLEBONDS

int

Number of rotatable bonds, indicates molecular flexibility

FRACTIONCSP3

float

Fraction of sp³ hybridized carbons (saturated carbons)

Tag format:

[NUMROTATABLEBONDS]8[/NUMROTATABLEBONDS]
[FRACTIONCSP3]0.32[/FRACTIONCSP3]

Ring Properties

Detailed information about ring systems in molecules:

Basic Ring Counts

chemlactica/mol_opt/example_run.py

from rdkit.Chem import rdMolDescriptors

num_rings = rdMolDescriptors.CalcNumRings(molecule.mol)

RINGCOUNT

int

Total number of rings in the molecule

NUMAROMATICRINGS

int

Number of aromatic rings (like benzene)

NUMSATURATEDRINGS

int

Number of saturated (non-aromatic) rings

NUMALIPHATICRINGS

int

Number of aliphatic (non-aromatic) rings

Hetero vs Carbocyclic Rings

Aromatic Rings

NUMAROMATICHETEROCYCLES

int

Aromatic rings containing heteroatoms (like pyridine)

NUMAROMATICCARBOCYCLES

int

Aromatic rings with only carbon atoms (like benzene)

Saturated Rings

NUMSATURATEDHETEROCYCLES

int

Saturated rings containing heteroatoms (like piperidine)

NUMSATURATEDCARBOCYCLES

int

Saturated rings with only carbon atoms (like cyclohexane)

Aliphatic Rings

NUMALIPHATICHETEROCYCLES

int

Aliphatic rings containing heteroatoms

NUMALIPHATICCARBOCYCLES

int

Aliphatic rings with only carbon atoms

Molecular Similarity

Tanimoto Similarity

ChemLactica uses Tanimoto similarity over Morgan fingerprints (ECFC4) to measure molecular similarity:

chemlactica/mol_opt/utils.py

from rdkit import DataStructs
from rdkit.Chem import AllChem

def get_morgan_fingerprint(mol):
    return AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=2048)

def tanimoto_dist_func(fing1, fing2, fingerprint: str = "morgan"):
    return DataStructs.TanimotoSimilarity(fing1, fing2)

Morgan fingerprints (also called ECFC4 - Extended Connectivity Fingerprints, diameter 4) capture the local chemical environment around each atom up to 2 bonds away.

Tag format: [SIMILAR]CC(=O)OC1=CC=CC=C1C(=O)O 0.62[/SIMILAR]

Similarity Ranges

0.0 - 0.3: Very different molecules
0.3 - 0.5: Some structural similarity
0.5 - 0.7: Moderate similarity
0.7 - 0.9: High similarity
0.9 - 1.0: Very similar (1.0 = identical)

Property Tags Reference

Complete list from the codebase:

chemlactica/utils/text_format_utils.py

SPECIAL_TAGS = {
    "SMILES": {"start": "[START_SMILES]", "end": "[END_SMILES]"},
    "similarity": {"start": "[SIMILAR]", "end": "[/SIMILAR]", "type": float},
    "SAS": {"start": "[SAS]", "end": "[/SAS]", "type": float},
    "WEIGHT": {"start": "[WEIGHT]", "end": "[/WEIGHT]", "type": float},
    "TPSA": {"start": "[TPSA]", "end": "[/TPSA]", "type": float},
    "CLOGP": {"start": "[CLOGP]", "end": "[/CLOGP]", "type": float},
    "QED": {"start": "[QED]", "end": "[/QED]", "type": float},
    "FRACTIONCSP3": {"start": "[FRACTIONCSP3]", "end": "[/FRACTIONCSP3]", "type": float},
    # ... and many more
}

Custom Properties

You can also define custom properties for fine-tuning and optimization:

# Generic property tag
"[PROPERTY]oracle_score 0.85[/PROPERTY]"

# Used in rejection sampling for optimization
"[PROPERTY]oracle_score {mol_entry.score:.2f}[/PROPERTY]"

Property-Guided Generation

Example: Generate High QED Molecule

chemlactica/generation/rejection_sampling_utils.py

input_text = f"</s>[QED]{random.uniform(0.9, 0.99):.2f}[/QED][START_SMILES]"

Example: TPSA + Weight Oracle

From the optimization example:

chemlactica/mol_opt/example_run.py

class TPSA_Weight_Oracle:
    def __call__(self, molecules: List[MoleculeEntry]):
        oracle_scores = []
        for molecule in molecules:
            try:
                tpsa = rdMolDescriptors.CalcTPSA(molecule.mol)
                oracle_score = tpsa
                weight = rdMolDescriptors.CalcExactMolWt(molecule.mol)
                num_rings = rdMolDescriptors.CalcNumRings(molecule.mol)
                
                # Apply constraints
                if weight >= 350:
                    oracle_score = 0
                if num_rings < 2:
                    oracle_score = 0
            except Exception as e:
                oracle_score = 0
                
            oracle_scores.append(oracle_score)
        return oracle_scores

Property Formatting

Float properties are always formatted to 2 decimal places in the training data:

chemlactica/utils/text_format_utils.py

if SPECIAL_TAGS[key].get("type") is float:
    value = "{:.2f}".format(float(value))
    assert len(value.split(".")[-1]) == 2

Next Steps

Molecular Optimization

Use properties for optimization

SMILES Format

Learn about molecular representation

Model Architectures

Explore the models

Property Prediction

Fine-tune for predictions

Getting Started

Core Concepts

Training

Molecular Optimization

Generation

Guides

Overview

Core Properties

QED (Quantitative Estimate of Drug-likeness)

SAS (Synthetic Accessibility Score)

TPSA (Topological Polar Surface Area)

CLogP (Partition Coefficient)

Molecular Weight

Structural Properties

Hydrogen Bond Donors and Acceptors

Atom Counts

Bond Properties

Ring Properties

Basic Ring Counts

Hetero vs Carbocyclic Rings

Molecular Similarity

Tanimoto Similarity

Similarity Ranges

Property Tags Reference

Custom Properties

Property-Guided Generation

Example: Generate High QED Molecule

Example: TPSA + Weight Oracle

Property Formatting

Next Steps

Molecular Optimization

SMILES Format

Model Architectures

Property Prediction

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Training

Molecular Optimization

Generation

Guides

​Overview

​Core Properties

​QED (Quantitative Estimate of Drug-likeness)

​SAS (Synthetic Accessibility Score)

​TPSA (Topological Polar Surface Area)

​CLogP (Partition Coefficient)

​Molecular Weight

​Structural Properties

​Hydrogen Bond Donors and Acceptors

​Atom Counts

​Bond Properties

​Ring Properties

​Basic Ring Counts

​Hetero vs Carbocyclic Rings

​Molecular Similarity

​Tanimoto Similarity

​Similarity Ranges

​Property Tags Reference

​Custom Properties

​Property-Guided Generation

​Example: Generate High QED Molecule

​Example: TPSA + Weight Oracle

​Property Formatting

​Next Steps

Molecular Optimization

SMILES Format

Model Architectures

Property Prediction

Build docs developers (and LLMs) love

Overview

Core Properties

QED (Quantitative Estimate of Drug-likeness)

SAS (Synthetic Accessibility Score)

TPSA (Topological Polar Surface Area)

CLogP (Partition Coefficient)

Molecular Weight

Structural Properties

Hydrogen Bond Donors and Acceptors

Atom Counts

Bond Properties

Ring Properties

Basic Ring Counts

Hetero vs Carbocyclic Rings

Molecular Similarity

Tanimoto Similarity

Similarity Ranges

Property Tags Reference

Custom Properties

Property-Guided Generation

Example: Generate High QED Molecule

Example: TPSA + Weight Oracle

Property Formatting

Next Steps