Prompt Format

Overview

ChemLactica models use a structured prompt format to condition molecule generation on desired properties and structural similarities. Prompts are constructed using special tags that the model learned during pretraining.

Example from the README: </s>[SAS]2.25[/SAS][SIMILAR]CC(=O)OC1=CC=CC=C1C(=O)O 0.62[/SIMILAR][START_SMILES] generates a molecule with SAS score ~2.25 and Tanimoto similarity ~0.62 to aspirin.

Prompt Structure

A complete prompt follows this pattern:

</s>[PROPERTY_TAG]value[/PROPERTY_TAG]...[SIMILAR]SMILES similarity[/SIMILAR]...[START_SMILES]

Components

EOS Token (</s>): Marks the start of a new prompt
Property Tags: Specify desired molecular properties
Similarity Tags: Reference molecules with similarity scores
Start Tag: [START_SMILES] signals the model to begin SMILES generation

Property Tags

Available Properties

ChemLactica models understand these molecular properties from pretraining:

SAS - Synthetic Accessibility Score

Estimates how difficult a molecule is to synthesize.

Range: 1.0 (easy) to 10.0 (very difficult)
Typical values: 2.0-4.0 for drug-like molecules

"</s>[SAS]2.25[/SAS][START_SMILES]"

QED - Quantitative Estimate of Drug-likeness

Measures drug-likeness based on molecular properties.

Range: 0.0 to 1.0
Typical values: >0.7 for drug-like molecules

"</s>[QED]0.9[/QED][START_SMILES]"

TPSA - Topological Polar Surface Area

Surface area of polar atoms, important for membrane permeability.

Range: 0 to ~300 Ų
Typical values: 20-140 Ų for oral bioavailability

"</s>[TPSA]80[/TPSA][START_SMILES]"

CLogP - Calculated LogP

Lipophilicity measure for membrane permeability.

Range: Typically -2 to 6
Typical values: 0-5 for drug-like molecules

"</s>[CLogP]2.5[/CLogP][START_SMILES]"

Molecular Weight

Total molecular weight in Daltons.

Range: Variable
Typical values: 150-500 Da for drug-like molecules

"</s>[MW]350[/MW][START_SMILES]"

Multiple Properties

Combine multiple property tags in a single prompt:

prompt = "</s>[SAS]2.5[/SAS][QED]0.85[/QED][TPSA]75[/TPSA][START_SMILES]"

Similarity Constraints

SIMILAR Tag Format

Reference molecules to guide structural similarity:

"[SIMILAR]{reference_smiles} {similarity_score}[/SIMILAR]"

reference_smiles: SMILES string of the reference molecule
similarity_score: Tanimoto similarity (0.0 to 1.0) using Morgan fingerprints (ECFC4)

Example: Aspirin Analogs

aspirin = "CC(=O)OC1=CC=CC=C1C(=O)O"
prompt = f"</s>[SIMILAR]{aspirin} 0.62[/SIMILAR][START_SMILES]"

This generates molecules with ~0.62 Tanimoto similarity to aspirin.

Multiple Similar Molecules

Include multiple reference molecules:

utils.py:175

def create_prompt_with_similars(mol_entry: MoleculeEntry, sim_range=None):
    """Create SIMILAR tags for reference molecules."""
    prompt = ""
    for sim_mol_entry in mol_entry.similar_mol_entries:
        if sim_range:
            # Use random similarity in range for generation
            prompt += f"[SIMILAR]{sim_mol_entry.smiles} {generate_random_number(sim_range[0], sim_range[1]):.2f}[/SIMILAR]"
        else:
            # Use actual Tanimoto similarity
            prompt += f"[SIMILAR]{sim_mol_entry.smiles} {tanimoto_dist_func(sim_mol_entry.fingerprint, mol_entry.fingerprint):.2f}[/SIMILAR]"
    return prompt

In optimization, similarity ranges (e.g., 0.4-0.9) are used to encourage exploration while maintaining structural relationships.

Prompt Construction in Optimization

The OptimEntry.to_prompt() method builds complete prompts for optimization:

utils.py:197

class OptimEntry:
    def to_prompt(
        self, 
        is_generation: bool,
        include_oracle_score: bool, 
        config,
        max_score
    ):
        """Construct prompt from molecule entries and configuration."""
        prompt = ""
        
        # Add historical molecules from optimization trajectory
        for mol_entry in self.mol_entries:
            prompt += config["eos_token"]  # "</s>"
            
            # Add similar molecules
            prompt += create_prompt_with_similars(mol_entry=mol_entry)
            
            # Add custom properties
            for prop_name, prop_spec in mol_entry.add_props.items():
                prompt += f"{prop_spec['start_tag']}{prop_spec['value']}{prop_spec['end_tag']}"
            
            # Add oracle score (for rejection sampling)
            if "rej-sample-v2" in config["strategy"]:
                if include_oracle_score:
                    prompt += f"[PROPERTY]oracle_score {mol_entry.score:.2f}[/PROPERTY]"
            
            prompt += f"[START_SMILES]{mol_entry.smiles}[END_SMILES]"
        
        # Add final generation prompt
        prompt += config["eos_token"]
        
        if is_generation:
            # Use similarity range for exploration
            prompt_with_similars = create_prompt_with_similars(
                self.last_entry, 
                sim_range=config["sim_range"]  # e.g., [0.4, 0.9]
            )
        else:
            # Use actual similarities for training
            prompt_with_similars = create_prompt_with_similars(self.last_entry)
        
        prompt += prompt_with_similars
        
        # Add properties for target molecule
        for prop_name, prop_spec in self.last_entry.add_props.items():
            prompt += prop_spec["start_tag"] + prop_spec["infer_value"](self.last_entry) + prop_spec["end_tag"]
        
        # Add desired oracle score for generation
        if "rej-sample-v2" in config["strategy"]:
            if is_generation:
                desired_oracle_score = generate_random_number(
                    max_score, 
                    config["max_possible_oracle_score"]
                )
                oracle_score = desired_oracle_score
            else:
                oracle_score = self.last_entry.score
            
            if include_oracle_score:
                prompt += f"[PROPERTY]oracle_score {oracle_score:.2f}[/PROPERTY]"
        
        if is_generation:
            prompt += "[START_SMILES]"
        else:
            prompt += f"[START_SMILES]{self.last_entry.smiles}[END_SMILES]"
            prompt += config["eos_token"]
        
        return prompt

Custom Properties

Define custom properties for specific optimization tasks:

example_run.py:98

# Define custom property specification
additional_properties = {
    "tpsa": {
        "start_tag": "[TPSA]",
        "end_tag": "[/TPSA]",
        "calculate_value": lambda mol_entry: f"{rdMolDescriptors.CalcTPSA(mol_entry.mol):.2f}",
        "infer_value": lambda mol_entry: f"{rdMolDescriptors.CalcTPSA(mol_entry.mol):.2f}",
        "value": None  # Calculated later
    }
}

# Pass to optimization
optimize(
    model, tokenizer,
    oracle, config,
    additional_properties=additional_properties
)

Oracle Score Property

When using rejection sampling (rej-sample-v2 strategy), oracle scores are included:

utils.py:214

if "rej-sample-v2" in config["strategy"]:
    if include_oracle_score:
        prompt += f"[PROPERTY]oracle_score {mol_entry.score:.2f}[/PROPERTY]"

This helps the model learn to generate molecules with higher oracle scores.

Example Prompts

Basic Property-Conditioned Generation

# Generate drug-like molecule
"</s>[QED]0.9[/QED][START_SMILES]"

# Generate easily synthesizable molecule
"</s>[SAS]2.0[/SAS][START_SMILES]"

# Generate with specific TPSA
"</s>[TPSA]80[/TPSA][START_SMILES]"

Similarity-Guided Generation

# Generate aspirin analog
aspirin = "CC(=O)OC1=CC=CC=C1C(=O)O"
f"</s>[SIMILAR]{aspirin} 0.7[/SIMILAR][START_SMILES]"

# Multiple similar molecules
ref1 = "CC(=O)OC1=CC=CC=C1C(=O)O"
ref2 = "CC1=CC=CC=C1"
f"</s>[SIMILAR]{ref1} 0.6[/SIMILAR][SIMILAR]{ref2} 0.5[/SIMILAR][START_SMILES]"

Combined Constraints

# Drug-like aspirin analog with good synthetic accessibility
aspirin = "CC(=O)OC1=CC=CC=C1C(=O)O"
f"</s>[QED]0.85[/QED][SAS]2.5[/SAS][SIMILAR]{aspirin} 0.62[/SIMILAR][START_SMILES]"

Optimization with Rejection Sampling

# Training prompt (is_generation=False, include_oracle_score=True)
training_prompt = (
    "</s>"
    "[SIMILAR]CC(=O)OC1=CC=CC=C1C(=O)O 0.65[/SIMILAR]"
    "[QED]0.88[/QED]"
    "[PROPERTY]oracle_score 145.50[/PROPERTY]"
    "[START_SMILES]CC(=O)OC1=CC=C(C=C1)C(=O)O[END_SMILES]"
    "</s>"
)

# Generation prompt (is_generation=True)
generation_prompt = (
    "</s>"
    "[SIMILAR]CC(=O)OC1=CC=C(C=C1)C(=O)O 0.75[/SIMILAR]"
    "[QED]0.90[/QED]"
    "[PROPERTY]oracle_score 150.00[/PROPERTY]"
    "[START_SMILES]"
)

Best Practices

Property Values

Use realistic property value ranges based on your target chemical space

Similarity Scores

Balance exploration (lower similarity) and exploitation (higher similarity)

Multiple References

Include 3-5 similar molecules to guide structural features

Prompt Length

Keep total prompt + generation under 2048 tokens

Always include the EOS token </s> at the start of prompts and the [START_SMILES] tag before generation.

Getting Started

Core Concepts

Training

Molecular Optimization

Generation

Guides

Overview

Prompt Structure

Components

Property Tags

Available Properties

Multiple Properties

Similarity Constraints

SIMILAR Tag Format

Example: Aspirin Analogs

Multiple Similar Molecules

Prompt Construction in Optimization

Custom Properties

Oracle Score Property

Example Prompts

Basic Property-Conditioned Generation

Similarity-Guided Generation

Combined Constraints

Optimization with Rejection Sampling

Best Practices

Property Values

Similarity Scores

Multiple References

Prompt Length

Next Steps

Molecule Generation

Sampling Strategies

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Training

Molecular Optimization

Generation

Guides

​Overview

​Prompt Structure

​Components

​Property Tags

​Available Properties

​Multiple Properties

​Similarity Constraints

​SIMILAR Tag Format

​Example: Aspirin Analogs

​Multiple Similar Molecules

​Prompt Construction in Optimization

​Custom Properties

​Oracle Score Property

​Example Prompts

​Basic Property-Conditioned Generation

​Similarity-Guided Generation

​Combined Constraints

​Optimization with Rejection Sampling

​Best Practices

Property Values

Similarity Scores

Multiple References

Prompt Length

​Next Steps

Molecule Generation

Sampling Strategies

Build docs developers (and LLMs) love

Overview

Prompt Structure

Components

Property Tags

Available Properties

Multiple Properties

Similarity Constraints

SIMILAR Tag Format

Example: Aspirin Analogs

Multiple Similar Molecules

Prompt Construction in Optimization

Custom Properties

Oracle Score Property

Example Prompts

Basic Property-Conditioned Generation

Similarity-Guided Generation

Combined Constraints

Optimization with Rejection Sampling

Best Practices

Next Steps