SMILES Format

What is SMILES?

SMILES (Simplified Molecular Input Line Entry System) is a notation that represents molecular structures as linear strings of characters. ChemLactica models use SMILES as the primary representation format for small organic molecules.

SMILES allows complex molecular structures to be represented as simple text strings, making them perfect for processing with language models.

SMILES in ChemLactica

Special Tags

ChemLactica uses special tags to mark SMILES strings in the input:

chemlactica/utils/text_format_utils.py

SPECIAL_TAGS = {
    "SMILES": {"start": "[START_SMILES]", "end": "[END_SMILES]"},
    # ... other tags
}

Example Usage

# Aspirin molecule
smiles = "CC(=O)OC1=CC=CC=C1C(=O)O"

# Formatted for ChemLactica
formatted = "[START_SMILES]CC(=O)OC1=CC=CC=C1C(=O)O[END_SMILES]"

Canonical SMILES

ChemLactica internally converts all SMILES to their canonical form for consistency:

chemlactica/mol_opt/utils.py

from rdkit import Chem

def canonicalize(smiles):
    mol = Chem.MolFromSmiles(smiles)
    return Chem.MolToSmiles(mol, canonical=True)

Canonical SMILES ensures that the same molecule always has the same string representation, regardless of how it was originally written.

Why Canonicalization Matters

The same molecule can be written in multiple ways:

# These all represent the same molecule (ethanol)
"CCO"           # canonical
"OCC"           # alternative
"C(O)C"         # another alternative

# After canonicalization, all become:
"CCO"

SMILES Syntax Basics

Atoms

Organic atoms

string

Atoms like C, N, O, P, S, F, Cl, Br, I can be written directly

C   = Carbon
N   = Nitrogen  
O   = Oxygen

Bracket atoms

string

Other atoms or atoms with charges/isotopes use brackets

[Cu]   = Copper
[O-]   = Oxygen anion
[13C]  = Carbon-13 isotope

Bonds

# Single bond (implicit)
"CC"      # ethane: C-C

# Double bond
"C=C"     # ethene: C=C

# Triple bond  
"C#C"     # ethyne: C≡C

# Aromatic bond
"c1ccccc1" # benzene (aromatic)

Branches

# Parentheses indicate branches
"CC(C)C"   # isobutane: has a branch
"CC(C)(C)C" # neopentane: two branches on same carbon

Rings

# Numbers indicate ring closures
"C1CCCCC1"  # cyclohexane: 6-membered ring
"C1CC1"     # cyclopropane: 3-membered ring
"c1ccccc1"  # benzene: aromatic ring

SMILES Examples from ChemLactica

Here are real examples from the codebase:

Simple Molecules

chemlactica/utils/utils.py

# Example from test code
prompt = "[START_SMILES] CCCCN [END_SMILES][CLOGP 0.00][SAS 123][QED]"

With Properties and Similarity

# From README example - generating similar molecule
prompt = """</s>[SAS]2.25[/SAS]
[SIMILAR]CC(=O)OC1=CC=CC=C1C(=O)O 0.62[/SIMILAR]
[START_SMILES]"""

# This generates a molecule with:
# - SAS score around 2.25
# - Similarity ~0.62 to aspirin (CC(=O)OC1=CC=CC=C1C(=O)O)

Molecular Optimization Context

chemlactica/mol_opt/utils.py

def create_prompt_with_similars(mol_entry: MoleculeEntry, sim_range=None):
    prompt = ""
    for sim_mol_entry in mol_entry.similar_mol_entries:
        if sim_range:
            prompt += f"[SIMILAR]{sim_mol_entry.smiles} {generate_random_number(sim_range[0], sim_range[1]):.2f}[/SIMILAR]"
        else:
            prompt += f"[SIMILAR]{sim_mol_entry.smiles} {tanimoto_dist_func(sim_mol_entry.fingerprint, mol_entry.fingerprint):.2f}[/SIMILAR]"
    return prompt

Working with SMILES in ChemLactica

Converting SMILES to Molecular Objects

chemlactica/mol_opt/utils.py

from rdkit import Chem

class MoleculeEntry:
    def __init__(self, smiles, score=0, **kwargs):
        self.smiles = smiles
        self.score = score
        if smiles:
            self.smiles = canonicalize(smiles)
            self.mol = Chem.MolFromSmiles(smiles)
            self.fingerprint = get_morgan_fingerprint(self.mol)

Validation

from rdkit import Chem

def is_valid_smiles(smiles):
    """Check if a SMILES string is valid"""
    mol = Chem.MolFromSmiles(smiles)
    return mol is not None

Advanced SMILES Features

Stereochemistry

# Cis/trans isomerism
"C/C=C/C"   # trans-2-butene
"C/C=C\\C"   # cis-2-butene

# Chiral centers  
"C[C@H](O)CC"   # (S)-2-butanol
"C[C@@H](O)CC"  # (R)-2-butanol

Aromatic Systems

# Lowercase letters indicate aromatic atoms
"c1ccccc1"      # benzene
"c1ccc2ccccc2c1" # naphthalene

Common Molecules in SMILES

Drug Molecules

# Aspirin
"CC(=O)OC1=CC=CC=C1C(=O)O"

# Ibuprofen  
"CC(C)CC1=CC=C(C=C1)C(C)C(=O)O"

# Caffeine
"CN1C=NC2=C1C(=O)N(C(=O)N2C)C"

Simple Organic Molecules

# Ethanol
"CCO"

# Acetic acid
"CC(=O)O"

# Benzene
"c1ccccc1"

# Glucose
"C(C1C(C(C(C(O1)O)O)O)O)O"

SMILES Resources

RDKit Documentation

Official RDKit SMILES guide

Daylight SMILES Theory

Original SMILES specification

Next Steps

Molecular Properties

Learn about calculated properties

Model Architectures

Explore the model family

Training Data

Understand the corpus

Generation Guide

Generate molecules

Getting Started

Core Concepts

Training

Molecular Optimization

Generation

Guides

What is SMILES?

SMILES in ChemLactica

Special Tags

Example Usage

Canonical SMILES

Why Canonicalization Matters

SMILES Syntax Basics

Atoms

Bonds

Branches

Rings

SMILES Examples from ChemLactica

Simple Molecules

With Properties and Similarity

Molecular Optimization Context

Working with SMILES in ChemLactica

Converting SMILES to Molecular Objects

Validation

Advanced SMILES Features

Stereochemistry

Aromatic Systems

Common Molecules in SMILES

SMILES Resources

RDKit Documentation

Daylight SMILES Theory

Next Steps

Molecular Properties

Model Architectures

Training Data

Generation Guide

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Training

Molecular Optimization

Generation

Guides

​What is SMILES?

​SMILES in ChemLactica

​Special Tags

​Example Usage

​Canonical SMILES

​Why Canonicalization Matters

​SMILES Syntax Basics

​Atoms

​Bonds

​Branches

​Rings

​SMILES Examples from ChemLactica

​Simple Molecules

​With Properties and Similarity

​Molecular Optimization Context

​Working with SMILES in ChemLactica

​Converting SMILES to Molecular Objects

​Validation

​Advanced SMILES Features

​Stereochemistry

​Aromatic Systems

​Common Molecules in SMILES

​SMILES Resources

RDKit Documentation

Daylight SMILES Theory

​Next Steps

Molecular Properties

Model Architectures

Training Data

Generation Guide

Build docs developers (and LLMs) love

What is SMILES?

SMILES in ChemLactica

Special Tags

Example Usage

Canonical SMILES

Why Canonicalization Matters

SMILES Syntax Basics

Atoms

Bonds

Branches

Rings

SMILES Examples from ChemLactica

Simple Molecules

With Properties and Similarity

Molecular Optimization Context

Working with SMILES in ChemLactica

Converting SMILES to Molecular Objects

Validation

Advanced SMILES Features

Stereochemistry

Aromatic Systems

Common Molecules in SMILES

SMILES Resources

Next Steps