What is SMILES?
SMILES (Simplified Molecular Input Line Entry System) is a notation that represents molecular structures as linear strings of characters. ChemLactica models use SMILES as the primary representation format for small organic molecules.
SMILES allows complex molecular structures to be represented as simple text strings, making them perfect for processing with language models.
SMILES in ChemLactica
ChemLactica uses special tags to mark SMILES strings in the input:
chemlactica/utils/text_format_utils.py
SPECIAL_TAGS = {
"SMILES" : { "start" : "[START_SMILES]" , "end" : "[END_SMILES]" },
# ... other tags
}
Example Usage
# Aspirin molecule
smiles = "CC(=O)OC1=CC=CC=C1C(=O)O"
# Formatted for ChemLactica
formatted = "[START_SMILES]CC(=O)OC1=CC=CC=C1C(=O)O[END_SMILES]"
Canonical SMILES
ChemLactica internally converts all SMILES to their canonical form for consistency:
chemlactica/mol_opt/utils.py
from rdkit import Chem
def canonicalize ( smiles ):
mol = Chem.MolFromSmiles(smiles)
return Chem.MolToSmiles(mol, canonical = True )
Canonical SMILES ensures that the same molecule always has the same string representation, regardless of how it was originally written.
Why Canonicalization Matters
The same molecule can be written in multiple ways:
# These all represent the same molecule (ethanol)
"CCO" # canonical
"OCC" # alternative
"C(O)C" # another alternative
# After canonicalization, all become:
"CCO"
SMILES Syntax Basics
Atoms
Atoms like C, N, O, P, S, F, Cl, Br, I can be written directly C = Carbon
N = Nitrogen
O = Oxygen
Other atoms or atoms with charges/isotopes use brackets [Cu] = Copper
[O-] = Oxygen anion
[13C] = Carbon-13 isotope
Bonds
# Single bond (implicit)
"CC" # ethane: C-C
# Double bond
"C=C" # ethene: C=C
# Triple bond
"C#C" # ethyne: C≡C
# Aromatic bond
"c1ccccc1" # benzene (aromatic)
Branches
# Parentheses indicate branches
"CC(C)C" # isobutane: has a branch
"CC(C)(C)C" # neopentane: two branches on same carbon
Rings
# Numbers indicate ring closures
"C1CCCCC1" # cyclohexane: 6-membered ring
"C1CC1" # cyclopropane: 3-membered ring
"c1ccccc1" # benzene: aromatic ring
SMILES Examples from ChemLactica
Here are real examples from the codebase:
Simple Molecules
chemlactica/utils/utils.py
# Example from test code
prompt = "[START_SMILES] CCCCN [END_SMILES][CLOGP 0.00][SAS 123][QED]"
With Properties and Similarity
# From README example - generating similar molecule
prompt = """</s>[SAS]2.25[/SAS]
[SIMILAR]CC(=O)OC1=CC=CC=C1C(=O)O 0.62[/SIMILAR]
[START_SMILES]"""
# This generates a molecule with:
# - SAS score around 2.25
# - Similarity ~0.62 to aspirin (CC(=O)OC1=CC=CC=C1C(=O)O)
Molecular Optimization Context
chemlactica/mol_opt/utils.py
def create_prompt_with_similars ( mol_entry : MoleculeEntry, sim_range = None ):
prompt = ""
for sim_mol_entry in mol_entry.similar_mol_entries:
if sim_range:
prompt += f "[SIMILAR] { sim_mol_entry.smiles } { generate_random_number(sim_range[ 0 ], sim_range[ 1 ]) :.2f} [/SIMILAR]"
else :
prompt += f "[SIMILAR] { sim_mol_entry.smiles } { tanimoto_dist_func(sim_mol_entry.fingerprint, mol_entry.fingerprint) :.2f} [/SIMILAR]"
return prompt
Working with SMILES in ChemLactica
Converting SMILES to Molecular Objects
chemlactica/mol_opt/utils.py
from rdkit import Chem
class MoleculeEntry :
def __init__ ( self , smiles , score = 0 , ** kwargs ):
self .smiles = smiles
self .score = score
if smiles:
self .smiles = canonicalize(smiles)
self .mol = Chem.MolFromSmiles(smiles)
self .fingerprint = get_morgan_fingerprint( self .mol)
Validation
from rdkit import Chem
def is_valid_smiles ( smiles ):
"""Check if a SMILES string is valid"""
mol = Chem.MolFromSmiles(smiles)
return mol is not None
Advanced SMILES Features
Stereochemistry
# Cis/trans isomerism
"C/C=C/C" # trans-2-butene
"C/C=C \\ C" # cis-2-butene
# Chiral centers
"C[C@H](O)CC" # (S)-2-butanol
"C[C@@H](O)CC" # (R)-2-butanol
Aromatic Systems
# Lowercase letters indicate aromatic atoms
"c1ccccc1" # benzene
"c1ccc2ccccc2c1" # naphthalene
Common Molecules in SMILES
# Aspirin
"CC(=O)OC1=CC=CC=C1C(=O)O"
# Ibuprofen
"CC(C)CC1=CC=C(C=C1)C(C)C(=O)O"
# Caffeine
"CN1C=NC2=C1C(=O)N(C(=O)N2C)C"
# Ethanol
"CCO"
# Acetic acid
"CC(=O)O"
# Benzene
"c1ccccc1"
# Glucose
"C(C1C(C(C(C(O1)O)O)O)O)O"
SMILES Resources
RDKit Documentation Official RDKit SMILES guide
Daylight SMILES Theory Original SMILES specification
Next Steps
Molecular Properties Learn about calculated properties
Model Architectures Explore the model family
Training Data Understand the corpus
Generation Guide Generate molecules