Training Data - ChemLactica

Training Corpus

All ChemLactica and Chemma models are trained on a comprehensive corpus derived from PubChem, the world’s largest collection of freely accessible chemical information.

Corpus Size

40 billion tokensCovering 100M+ molecules

Dataset on HuggingFace

yerevann/PubChemForLMPublicly available dataset

Data Format

JSONL Structure

The training data is stored in JSONL (JSON Lines) format, with each line representing one molecular entry:

chemlactica/get_dataset.py

training_data_files = glob.glob(training_data_dir + "/*.jsonl")

dataset = IterableDataset.from_generator(
    samples_generator,
    gen_kwargs={
        "files": training_data_files,
        "shared_jsonl_files": shared_jsonl_files,
    },
)

Example Entry

Each molecule in the corpus contains multiple fields:

{
  "CID": 523129,
  "SMILES": "CCCCCCOC(=O)NC1=CC=C(C=C1)N=NC2=CC=CC=C2",
  "synonyms": [{"name": "p-Phenylazo carbanilic acid, n-hexyl ester"}],
  "related": [
    {"SMILES": "CCCCOC(=O)NC1=CC=C(C=C1)N", "similarity": 0.7}
  ],
  "SAS": 2.0,
  "WEIGHT": 325.1,
  "TPSA": 63.06823,
  "CLOGP": 6.231230123,
  "QED": 0.46,
  "NUMHDONORS": 1,
  "NUMHACCEPTORS": 4,
  "NUMHETEROATOMS": 5,
  "NUMROTATABLEBONDS": 8,
  "NOCOUNT": 5,
  "NHOHCOUNT": 1,
  "RINGCOUNT": 2,
  "HEAVYATOMCOUNT": 24,
  "FRACTIONCSP3": 0.32,
  "NUMAROMATICRINGS": 2,
  "NUMSATURATEDRINGS": 0,
  "IUPAC": "hexyl N-(4-phenyldiazenylphenyl)carbamate",
  "experimental": [
    {
      "PROPERTY_NAME": "Kovats Retention Index",
      "PROPERTY_VALUE": "Semi-standard non-polar: 4303"
    }
  ]
}

Data Processing

Text Formatting

Raw JSON data is converted to formatted text strings for training:

chemlactica/utils/text_format_utils.py

def generate_formatted_string(compound_json, rng, model_config):
    key_value_pairs = []
    
    # SMILES may appear first (50% chance)
    key = "SMILES"
    value = compound_json.get(key, "")
    if rng.integers(2) == 0:
        if value:
            key_value_pairs.append(format_key_value(key, value, rng))
            del compound_json[key]
    
    # Shuffle remaining keys for variety
    keys = list(compound_json.keys())
    rng.shuffle(keys)
    
    for key in keys:
        key_value_pairs.append(format_key_value(key, compound_json[key], rng))
    
    compound_formatted_string = "".join(key_value_pairs) + model_config.separator_token
    return compound_formatted_string

The order of properties is randomized during training to prevent the model from learning order-dependent patterns.

Property Formatting

Different types of data are formatted with special tags:

chemlactica/utils/text_format_utils.py

def format_key_value(key, value, rng):
    if key == "related":
        # Similar molecules with Tanimoto scores
        if len(value) > 10:
            value = rng.choice(value, size=10, replace=False, shuffle=False)
        for pair in value:
            rounded_sim = "{:.2f}".format(float(pair["similarity"]))
            formatted_string += f"[SIMILAR]{pair['SMILES']} {rounded_sim}[/SIMILAR]"
            
    elif key == "experimental":
        # Experimental properties
        for pair in value:
            formatted_string += f"[PROPERTY]{pair['PROPERTY_NAME']} {pair['PROPERTY_VALUE']}[/PROPERTY]"
            
    elif key == "synonyms":
        # Chemical names
        for val in value:
            formatted_string += f"[SYNONYM]{val['name']}[/SYNONYM]"
    else:
        # Standard properties (QED, SAS, etc.)
        if SPECIAL_TAGS[key].get("type") is float:
            value = "{:.2f}".format(float(value))
        start = SPECIAL_TAGS[key]["start"]
        end = SPECIAL_TAGS[key]["end"]
        return f"{start}{value}{end}"

Example Formatted Output

[START_SMILES]CCCCCCOC(=O)NC1=CC=C(C=C1)N=NC2=CC=CC=C2[END_SMILES]
[SAS]2.00[/SAS]
[WEIGHT]325.10[/WEIGHT]
[TPSA]63.07[/TPSA]
[CLOGP]6.23[/CLOGP]
[QED]0.46[/QED]
[SIMILAR]CCCCOC(=O)NC1=CC=C(C=C1)N 0.70[/SIMILAR]
[SYNONYM]p-Phenylazo carbanilic acid, n-hexyl ester[/SYNONYM]
[NUMHDONORS]1[/NUMHDONORS]
[NUMHACCEPTORS]4[/NUMHACCEPTORS]
</s>

Data Types

The corpus supports different data types:

chemlactica/utils/dataset_utils.py

DIR_DATA_TYPES = ["pubchem", "assay", "assay-split", "custom"]

PubChem Data

Standard PubChem molecular data with computed properties and similar molecules.

Assay Data

Experimental assay data with biological activity measurements:

[ASSAY_NAME]Assay Name[/ASSAY_NAME]
[ASSAY_DESC]Description[/ASSAY_DESC]
[VAR_NAME]Variable[/VAR_NAME]
[VAR_VAL]Value[/VAR_VAL]

Custom Data

User-provided molecular data in the same JSONL format.

Dataset Loading

For Pre-training

chemlactica/get_dataset.py

def get_dataset(
    train_type,
    training_data_dirs,
    valid_data_dir,
    dir_data_types,
    train_config,
    model_config,
    shared_jsonl_files,
    evaluate_only,
    slurm_eval,
    shuffle_buffer_size,
):
    if train_type == "pretrain":
        assert len(training_data_dirs) == len(dir_data_types)
        train_dataset_dict = {}
        
        for i, (training_data_dir, dir_data_type) in enumerate(
            zip(training_data_dirs, dir_data_types)
        ):
            training_data_files = glob.glob(training_data_dir + "/*.jsonl")
            dataset = IterableDataset.from_generator(
                samples_generator,
                gen_kwargs={
                    "files": training_data_files,
                    "shared_jsonl_files": shared_jsonl_files,
                },
            )
            dataset = process_dataset(
                dataset=dataset,
                train_config=train_config,
                model_config=model_config,
                process_batch_sizes=(50, 50),
                is_eval=False,
                assay=is_assay_split,
            )

For Fine-tuning

if train_type == "sft":
    dataset = load_dataset(training_data_dirs[0])

Tokenization

Custom Tokenizer

ChemLactica uses specialized tokenizers optimized for chemical notation:

chemlactica/utils/utils.py

default_tokenizer_path = "./chemlactica/tokenizer/ChemLacticaTokenizer66"

def get_tokenizer(tokenizer_path):
    return create_tokenizer(tokenizer_path)

def create_tokenizer(tokenizer_path):
    tok = AutoTokenizer.from_pretrained(tokenizer_path)
    tok.add_bos_token = False
    tok.padding_side = "right"
    return tok

Vocabulary Size

Chemlactica

50,000 tokensOptimized for chemistry

Chemma

256,000 tokensLarger vocabulary from Gemma

Special Tokens

The tokenizer includes chemistry-specific special tokens:

chemlactica/utils/utils.py

def get_start2end_tags_map(tokenizer_path: str = default_tokenizer_path):
    with open(os.path.join(tokenizer_path, "special_tokens_map.json"), "r") as _f:
        special_tokens_map = json.load(_f)
    additional_tokens = special_tokens_map.get("additional_special_tokens", None)
    n = len(additional_tokens)
    return {
        additional_tokens[i]: additional_tokens[n // 2 + i] for i in range(n // 2)
    } | {"[START_SMILES]": "[END_SMILES]"}

Data Statistics

Corpus Coverage

100M+ unique molecules from PubChem
40B tokens total
Multiple properties per molecule (avg 10-15 properties)
Similar molecules included for ~70% of compounds
Chemical synonyms and IUPAC names when available

Property Distribution

All molecules include:

SMILES notation (100%)
Core properties: SAS, WEIGHT, TPSA, CLOGP, QED (>95%)
Structural counts (>95%)
Similar molecules with Tanimoto scores (~70%)
Chemical names and synonyms (~60%)
Experimental properties (varies)

Training Configuration

Block Size

block_size: 2048  # Maximum sequence length

The 2048 token context window is sufficient for most molecules and their associated properties. Very large molecules or those with many similar molecules may be truncated.

Data Processing Parameters

process_batch_sizes=(50, 50)  # Batch sizes for processing
is_eval=False                  # Training vs evaluation mode
assay=False                    # Standard PubChem vs assay data

Accessing the Dataset

From Hugging Face

from datasets import load_dataset

# Load the full dataset
dataset = load_dataset("yerevann/PubChemForLM")

# Load streaming for large-scale training
dataset = load_dataset("yerevann/PubChemForLM", streaming=True)

Local JSONL Files

import glob
from datasets import load_dataset

training_data_files = glob.glob("path/to/data/*.jsonl")
dataset = load_dataset(
    "text",
    data_files={"train": training_data_files},
    streaming=True
)

Data Quality

Validation

All SMILES validated with RDKit
Properties computed from validated molecular structures
Similar molecules verified with Tanimoto similarity
Duplicate molecules removed via canonical SMILES

Preprocessing

Empty fields removed
Float values rounded to 2 decimal places
Related molecules limited to 10 per entry
Property order randomized for training diversity

Next Steps

Model Architectures

Learn about the models

Pre-training

Train your own model

Fine-tuning

Adapt to your dataset

SMILES Format

Understand molecular representation

Getting Started

Core Concepts

Training

Molecular Optimization

Generation

Guides

​Training Corpus

Corpus Size

Dataset on HuggingFace

​Data Format

​JSONL Structure

​Example Entry

​Data Processing

​Text Formatting

​Property Formatting

​Example Formatted Output

​Data Types

​Dataset Loading

​For Pre-training

​For Fine-tuning

​Tokenization

​Custom Tokenizer

​Vocabulary Size

Chemlactica

Chemma

​Special Tokens

​Data Statistics

​Corpus Coverage

​Property Distribution

​Training Configuration

​Block Size

​Data Processing Parameters

​Accessing the Dataset

​From Hugging Face

​Local JSONL Files

​Data Quality

​Next Steps

Model Architectures

Pre-training

Fine-tuning

SMILES Format

Build docs developers (and LLMs) love

Training Corpus

Data Format

JSONL Structure

Example Entry

Data Processing

Text Formatting

Property Formatting

Example Formatted Output

Data Types

Dataset Loading

For Pre-training

For Fine-tuning

Tokenization

Custom Tokenizer

Vocabulary Size

Special Tokens

Data Statistics

Corpus Coverage

Property Distribution

Training Configuration

Block Size

Data Processing Parameters

Accessing the Dataset

From Hugging Face

Local JSONL Files

Data Quality

Next Steps