Skip to main content

Overview

ChemLactica uses a specialized tokenizer designed for molecular SMILES strings and chemical property annotations. The tokenizer includes:
  • Standard BPE (Byte Pair Encoding) for SMILES
  • Special tokens for molecular properties
  • Custom tokens for chemical descriptors

Tokenizer Architecture

Available Tokenizers

ChemLactica provides several pre-configured tokenizers:
chemlactica/tokenizer/
├── ChemLacticaTokenizer/       # Standard tokenizer
├── ChemLacticaTokenizer66/     # Extended tokenizer (66 special tokens)
├── chemllama2-tokenizer/       # LLaMA2-based tokenizer
├── GemmaTokenizer/             # Gemma-based tokenizer
├── Mistral-7B-v0.1Tokenizer/   # Mistral-based tokenizer
└── galactica-125m/             # Galactica tokenizer
ChemLacticaTokenizer66 is recommended for most tasks as it includes comprehensive chemical property tokens.

Configuration

Tokenizer configuration is defined in tokenizer_config.json:
tokenizer_config.json
{
  "clean_up_tokenization_spaces": true,
  "model_max_length": 1000000000000000019884624838656,
  "tokenizer_class": "PreTrainedTokenizerFast"
}

Special Tokens

ChemLactica uses extensive special tokens for chemical annotations:

Core Tokens

{
  "bos_token": "<s>",
  "eos_token": "</s>",
  "pad_token": "<pad>"
}
  • <s>: Beginning of sequence token
  • </s>: End of sequence token (also used as separator)
  • <pad>: Padding token for batch processing

Molecular Structure Tokens

[
  "[START_SMILES]",
  "[END_SMILES]",
  "[SIMILAR]",
  "[/SIMILAR]",
  "[SYNONYM]",
  "[/SYNONYM]",
  "[RELATED]",
  "[/RELATED]"
]
Usage:
</s>[SIMILAR]CCO 0.95[/SIMILAR][START_SMILES]CC(C)O[END_SMILES]</s>

Property Tokens

Chemical properties are enclosed in specific tokens:
Property Tokens
[
  "[PROPERTY]", "[/PROPERTY]",
  "[QED]", "[/QED]",
  "[SAS]", "[/SAS]",
  "[WEIGHT]", "[/WEIGHT]",
  "[TPSA]", "[/TPSA]",
  "[CLOGP]", "[/CLOGP]",
  "[NUMHDONORS]", "[/NUMHDONORS]",
  "[NUMHACCEPTORS]", "[/NUMHACCEPTORS]",
  "[NUMHETEROATOMS]", "[/NUMHETEROATOMS]",
  "[NUMROTATABLEBONDS]", "[/NUMROTATABLEBONDS]",
  "[NOCOUNT]", "[/NOCOUNT]",
  "[NHOHCOUNT]", "[/NHOHCOUNT]",
  "[RINGCOUNT]", "[/RINGCOUNT]",
  "[HEAVYATOMCOUNT]", "[/HEAVYATOMCOUNT]",
  "[FRACTIONCSP3]", "[/FRACTIONCSP3]",
  "[NUMAROMATICRINGS]", "[/NUMAROMATICRINGS]",
  "[NUMSATURATEDRINGS]", "[/NUMSATURATEDRINGS]",
  "[NUMAROMATICHETEROCYCLES]", "[/NUMAROMATICHETEROCYCLES]",
  "[NUMAROMATICCARBOCYCLES]", "[/NUMAROMATICCARBOCYCLES]",
  "[NUMSATURATEDHETEROCYCLES]", "[/NUMSATURATEDHETEROCYCLES]",
  "[NUMSATURATEDCARBOCYCLES]", "[/NUMSATURATEDCARBOCYCLES]",
  "[NUMALIPHATICRINGS]", "[/NUMALIPHATICRINGS]",
  "[NUMALIPHATICHETEROCYCLES]", "[/NUMALIPHATICHETEROCYCLES]",
  "[NUMALIPHATICCARBOCYCLES]", "[/NUMALIPHATICCARBOCYCLES]"
]
Common Properties:
QED
float
Quantitative Estimate of Drug-likeness (0-1)
[QED]0.87[/QED]
SAS
float
Synthetic Accessibility Score (1-10)
[SAS]3.2[/SAS]
WEIGHT
float
Molecular weight in g/mol
[WEIGHT]324.5[/WEIGHT]
TPSA
float
Topological Polar Surface Area
[TPSA]45.3[/TPSA]
CLOGP
float
Calculated LogP (lipophilicity)
[CLOGP]2.3[/CLOGP]

Custom Property Tokens

For domain-specific properties:
[
  "[IUPAC]", "[/IUPAC]",
  "[VAR_NAME]", "[/VAR_NAME]",
  "[VAR_DESC]", "[/VAR_DESC]",
  "[VAR_VAL]", "[/VAR_VAL]",
  "[ASSAY_NAME]", "[/ASSAY_NAME]",
  "[ASSAY_DESC]", "[/ASSAY_DESC]"
]
Example:
[PROPERTY]activity 0.92[/PROPERTY]
[ASSAY_NAME]EGFR_binding[/ASSAY_NAME]

Loading the Tokenizer

Basic Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "chemlactica/tokenizer/ChemLacticaTokenizer66"
)

# Tokenize a SMILES string
text = "</s>[START_SMILES]CCO[END_SMILES]</s>"
tokens = tokenizer(text, return_tensors="pt")

print(f"Input IDs: {tokens['input_ids']}")
print(f"Tokens: {tokenizer.convert_ids_to_tokens(tokens['input_ids'][0])}")

With Padding

# For batch processing
tokenizer = AutoTokenizer.from_pretrained(
    "chemlactica/tokenizer/ChemLacticaTokenizer66",
    padding_side="left"  # Important for generation
)

smiles_list = [
    "</s>[START_SMILES]CCO[END_SMILES]</s>",
    "</s>[START_SMILES]c1ccccc1[END_SMILES]</s>"
]

tokens = tokenizer(
    smiles_list,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=512
)
Always use padding_side="left" for causal language model generation to ensure proper attention masking.

Prompt Formatting

Training Format

ChemLactica uses specific prompt formats for training:
def format_training_sample(smiles, similar_smiles, properties):
    prompt = "</s>"
    
    # Add similar molecules
    for sim_smiles, similarity in similar_smiles:
        prompt += f"[SIMILAR]{sim_smiles} {similarity:.2f}[/SIMILAR]"
    
    # Add properties
    for prop_name, prop_value in properties.items():
        prompt += f"[{prop_name.upper()}]{prop_value:.2f}[/{prop_name.upper()}]"
    
    # Add target SMILES
    prompt += f"[START_SMILES]{smiles}[END_SMILES]</s>"
    
    return prompt

# Example
sample = format_training_sample(
    smiles="CC(C)Oc1ccccc1",
    similar_smiles=[("CCOc1ccccc1", 0.92), ("Cc1ccccc1", 0.78)],
    properties={"qed": 0.87, "sas": 2.1}
)
print(sample)
Output:
</s>[SIMILAR]CCOc1ccccc1 0.92[/SIMILAR][SIMILAR]Cc1ccccc1 0.78[/SIMILAR][QED]0.87[/QED][SAS]2.10[/SAS][START_SMILES]CC(C)Oc1ccccc1[END_SMILES]</s>

Generation Format

For molecule generation, omit the target SMILES:
def format_generation_prompt(similar_smiles, desired_properties):
    prompt = "</s>"
    
    # Add similar molecules
    for sim_smiles, similarity in similar_smiles:
        prompt += f"[SIMILAR]{sim_smiles} {similarity:.2f}[/SIMILAR]"
    
    # Add desired properties
    for prop_name, prop_value in desired_properties.items():
        prompt += f"[{prop_name.upper()}]{prop_value:.2f}[/{prop_name.upper()}]"
    
    # Start generation
    prompt += "[START_SMILES]"
    
    return prompt

# Example
prompt = format_generation_prompt(
    similar_smiles=[("CCO", 0.85)],
    desired_properties={"qed": 0.90, "weight": 250.0}
)

Tokenizer Configuration

Model Config

from dataclasses import dataclass

@dataclass
class ModelConfig:
    block_size: int = 2048
    vocab_size: int = 50000
    separator_token: str = "</s>"
    separator_token_id: int = 2
    tokenizer_path: str = "chemlactica/tokenizer/ChemLacticaTokenizer66"

Resizing Token Embeddings

When using a pretrained model with a custom tokenizer:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("path/to/checkpoint")
tokenizer = AutoTokenizer.from_pretrained("chemlactica/tokenizer/ChemLacticaTokenizer66")

# Resize embeddings to match tokenizer
model.resize_token_embeddings(len(tokenizer), pad_to_multiple_of=8)
Use pad_to_multiple_of=8 for optimal GPU performance with tensor cores.

Advanced Usage

Custom Property Tags

Add custom properties for your domain:
# Add custom property to prompt
def add_custom_property(base_prompt, property_name, property_value):
    # Use generic PROPERTY tag
    return base_prompt + f"[PROPERTY]{property_name} {property_value:.2f}[/PROPERTY]"

prompt = "</s>[SIMILAR]CCO 0.95[/SIMILAR]"
prompt = add_custom_property(prompt, "binding_affinity", 8.5)
prompt = add_custom_property(prompt, "selectivity", 0.92)
prompt += "[START_SMILES]"

Response Template for SFT

For supervised fine-tuning, mask the prompt and only compute loss on the response:
from trl import DataCollatorForCompletionOnlyLM

tokenizer = AutoTokenizer.from_pretrained("chemlactica/tokenizer/ChemLacticaTokenizer66")

# Define where the response starts
response_template = tokenizer.encode("[PROPERTY]activity")

collator = DataCollatorForCompletionOnlyLM(
    response_template,
    tokenizer=tokenizer
)

# Use with SFTTrainer
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    data_collator=collator,
    ...
)

Extracting Generated SMILES

import re

def extract_smiles(generated_text):
    """Extract SMILES from generated text."""
    start_tag = "[START_SMILES]"
    end_tag = "[END_SMILES]"
    
    start_idx = generated_text.rfind(start_tag)
    end_idx = generated_text.rfind(end_tag)
    
    if start_idx == -1 or end_idx == -1:
        return None
    
    smiles = generated_text[start_idx + len(start_tag):end_idx]
    return smiles.strip()

# Example
output = model.generate(input_ids, max_new_tokens=128)
decoded = tokenizer.decode(output[0])
smiles = extract_smiles(decoded)

if smiles:
    print(f"Generated SMILES: {smiles}")
else:
    print("No valid SMILES found")

Tokenizer Statistics

Vocabulary Size

tokenizer = AutoTokenizer.from_pretrained("chemlactica/tokenizer/ChemLacticaTokenizer66")

print(f"Vocabulary size: {len(tokenizer)}")
print(f"Number of special tokens: {len(tokenizer.all_special_tokens)}")
print(f"Special tokens: {tokenizer.all_special_tokens[:10]}...")  # First 10

Token Length Analysis

def analyze_tokenization(smiles_list, tokenizer):
    lengths = []
    for smiles in smiles_list:
        tokens = tokenizer(smiles, return_tensors="pt")
        lengths.append(len(tokens['input_ids'][0]))
    
    print(f"Average token length: {np.mean(lengths):.2f}")
    print(f"Max token length: {np.max(lengths)}")
    print(f"Min token length: {np.min(lengths)}")
    
    return lengths

# Test on dataset
smiles_list = [...]
lengths = analyze_tokenization(smiles_list, tokenizer)

Best Practices

  • Always use the same tokenizer for training and inference
  • Save tokenizer configuration with model checkpoints
  • Verify special token IDs match between tokenizer and model
  • Monitor token lengths in your dataset
  • Set appropriate max_seq_length for SFT (typically 512-2048)
  • Use truncation=True to handle long sequences
  • Consider sequence length distribution when batching
  • Use specific property tokens when available (e.g., [QED] instead of [PROPERTY]qed)
  • Keep property values formatted consistently (2 decimal places)
  • Always wrap SMILES in [START_SMILES]...[END_SMILES]
  • Use </s> as both EOS and separator token
  • Use padding_side="left" for generation tasks
  • Use padding_side="right" for classification tasks
  • Enable pad_to_multiple_of=8 for optimal GPU utilization
  • Set padding=True for batch processing
The tokenizer is critical for model performance. Using the wrong tokenizer or configuration can lead to poor results or errors.

Next Steps

Custom Oracles

Build oracles using property tokens

Property Prediction

Fine-tune models with property annotations

Build docs developers (and LLMs) love