Skip to main content

Training Corpus

All ChemLactica and Chemma models are trained on a comprehensive corpus derived from PubChem, the world’s largest collection of freely accessible chemical information.

Corpus Size

40 billion tokensCovering 100M+ molecules

Dataset on HuggingFace

yerevann/PubChemForLMPublicly available dataset

Data Format

JSONL Structure

The training data is stored in JSONL (JSON Lines) format, with each line representing one molecular entry:
chemlactica/get_dataset.py
training_data_files = glob.glob(training_data_dir + "/*.jsonl")

dataset = IterableDataset.from_generator(
    samples_generator,
    gen_kwargs={
        "files": training_data_files,
        "shared_jsonl_files": shared_jsonl_files,
    },
)

Example Entry

Each molecule in the corpus contains multiple fields:
{
  "CID": 523129,
  "SMILES": "CCCCCCOC(=O)NC1=CC=C(C=C1)N=NC2=CC=CC=C2",
  "synonyms": [{"name": "p-Phenylazo carbanilic acid, n-hexyl ester"}],
  "related": [
    {"SMILES": "CCCCOC(=O)NC1=CC=C(C=C1)N", "similarity": 0.7}
  ],
  "SAS": 2.0,
  "WEIGHT": 325.1,
  "TPSA": 63.06823,
  "CLOGP": 6.231230123,
  "QED": 0.46,
  "NUMHDONORS": 1,
  "NUMHACCEPTORS": 4,
  "NUMHETEROATOMS": 5,
  "NUMROTATABLEBONDS": 8,
  "NOCOUNT": 5,
  "NHOHCOUNT": 1,
  "RINGCOUNT": 2,
  "HEAVYATOMCOUNT": 24,
  "FRACTIONCSP3": 0.32,
  "NUMAROMATICRINGS": 2,
  "NUMSATURATEDRINGS": 0,
  "IUPAC": "hexyl N-(4-phenyldiazenylphenyl)carbamate",
  "experimental": [
    {
      "PROPERTY_NAME": "Kovats Retention Index",
      "PROPERTY_VALUE": "Semi-standard non-polar: 4303"
    }
  ]
}

Data Processing

Text Formatting

Raw JSON data is converted to formatted text strings for training:
chemlactica/utils/text_format_utils.py
def generate_formatted_string(compound_json, rng, model_config):
    key_value_pairs = []
    
    # SMILES may appear first (50% chance)
    key = "SMILES"
    value = compound_json.get(key, "")
    if rng.integers(2) == 0:
        if value:
            key_value_pairs.append(format_key_value(key, value, rng))
            del compound_json[key]
    
    # Shuffle remaining keys for variety
    keys = list(compound_json.keys())
    rng.shuffle(keys)
    
    for key in keys:
        key_value_pairs.append(format_key_value(key, compound_json[key], rng))
    
    compound_formatted_string = "".join(key_value_pairs) + model_config.separator_token
    return compound_formatted_string
The order of properties is randomized during training to prevent the model from learning order-dependent patterns.

Property Formatting

Different types of data are formatted with special tags:
chemlactica/utils/text_format_utils.py
def format_key_value(key, value, rng):
    if key == "related":
        # Similar molecules with Tanimoto scores
        if len(value) > 10:
            value = rng.choice(value, size=10, replace=False, shuffle=False)
        for pair in value:
            rounded_sim = "{:.2f}".format(float(pair["similarity"]))
            formatted_string += f"[SIMILAR]{pair['SMILES']} {rounded_sim}[/SIMILAR]"
            
    elif key == "experimental":
        # Experimental properties
        for pair in value:
            formatted_string += f"[PROPERTY]{pair['PROPERTY_NAME']} {pair['PROPERTY_VALUE']}[/PROPERTY]"
            
    elif key == "synonyms":
        # Chemical names
        for val in value:
            formatted_string += f"[SYNONYM]{val['name']}[/SYNONYM]"
    else:
        # Standard properties (QED, SAS, etc.)
        if SPECIAL_TAGS[key].get("type") is float:
            value = "{:.2f}".format(float(value))
        start = SPECIAL_TAGS[key]["start"]
        end = SPECIAL_TAGS[key]["end"]
        return f"{start}{value}{end}"

Example Formatted Output

[START_SMILES]CCCCCCOC(=O)NC1=CC=C(C=C1)N=NC2=CC=CC=C2[END_SMILES]
[SAS]2.00[/SAS]
[WEIGHT]325.10[/WEIGHT]
[TPSA]63.07[/TPSA]
[CLOGP]6.23[/CLOGP]
[QED]0.46[/QED]
[SIMILAR]CCCCOC(=O)NC1=CC=C(C=C1)N 0.70[/SIMILAR]
[SYNONYM]p-Phenylazo carbanilic acid, n-hexyl ester[/SYNONYM]
[NUMHDONORS]1[/NUMHDONORS]
[NUMHACCEPTORS]4[/NUMHACCEPTORS]
</s>

Data Types

The corpus supports different data types:
chemlactica/utils/dataset_utils.py
DIR_DATA_TYPES = ["pubchem", "assay", "assay-split", "custom"]
Standard PubChem molecular data with computed properties and similar molecules.
Experimental assay data with biological activity measurements:
[ASSAY_NAME]Assay Name[/ASSAY_NAME]
[ASSAY_DESC]Description[/ASSAY_DESC]
[VAR_NAME]Variable[/VAR_NAME]
[VAR_VAL]Value[/VAR_VAL]
User-provided molecular data in the same JSONL format.

Dataset Loading

For Pre-training

chemlactica/get_dataset.py
def get_dataset(
    train_type,
    training_data_dirs,
    valid_data_dir,
    dir_data_types,
    train_config,
    model_config,
    shared_jsonl_files,
    evaluate_only,
    slurm_eval,
    shuffle_buffer_size,
):
    if train_type == "pretrain":
        assert len(training_data_dirs) == len(dir_data_types)
        train_dataset_dict = {}
        
        for i, (training_data_dir, dir_data_type) in enumerate(
            zip(training_data_dirs, dir_data_types)
        ):
            training_data_files = glob.glob(training_data_dir + "/*.jsonl")
            dataset = IterableDataset.from_generator(
                samples_generator,
                gen_kwargs={
                    "files": training_data_files,
                    "shared_jsonl_files": shared_jsonl_files,
                },
            )
            dataset = process_dataset(
                dataset=dataset,
                train_config=train_config,
                model_config=model_config,
                process_batch_sizes=(50, 50),
                is_eval=False,
                assay=is_assay_split,
            )

For Fine-tuning

if train_type == "sft":
    dataset = load_dataset(training_data_dirs[0])

Tokenization

Custom Tokenizer

ChemLactica uses specialized tokenizers optimized for chemical notation:
chemlactica/utils/utils.py
default_tokenizer_path = "./chemlactica/tokenizer/ChemLacticaTokenizer66"

def get_tokenizer(tokenizer_path):
    return create_tokenizer(tokenizer_path)

def create_tokenizer(tokenizer_path):
    tok = AutoTokenizer.from_pretrained(tokenizer_path)
    tok.add_bos_token = False
    tok.padding_side = "right"
    return tok

Vocabulary Size

Chemlactica

50,000 tokensOptimized for chemistry

Chemma

256,000 tokensLarger vocabulary from Gemma

Special Tokens

The tokenizer includes chemistry-specific special tokens:
chemlactica/utils/utils.py
def get_start2end_tags_map(tokenizer_path: str = default_tokenizer_path):
    with open(os.path.join(tokenizer_path, "special_tokens_map.json"), "r") as _f:
        special_tokens_map = json.load(_f)
    additional_tokens = special_tokens_map.get("additional_special_tokens", None)
    n = len(additional_tokens)
    return {
        additional_tokens[i]: additional_tokens[n // 2 + i] for i in range(n // 2)
    } | {"[START_SMILES]": "[END_SMILES]"}

Data Statistics

Corpus Coverage

  • 100M+ unique molecules from PubChem
  • 40B tokens total
  • Multiple properties per molecule (avg 10-15 properties)
  • Similar molecules included for ~70% of compounds
  • Chemical synonyms and IUPAC names when available

Property Distribution

All molecules include:
  • SMILES notation (100%)
  • Core properties: SAS, WEIGHT, TPSA, CLOGP, QED (>95%)
  • Structural counts (>95%)
  • Similar molecules with Tanimoto scores (~70%)
  • Chemical names and synonyms (~60%)
  • Experimental properties (varies)

Training Configuration

Block Size

block_size: 2048  # Maximum sequence length
The 2048 token context window is sufficient for most molecules and their associated properties. Very large molecules or those with many similar molecules may be truncated.

Data Processing Parameters

process_batch_sizes=(50, 50)  # Batch sizes for processing
is_eval=False                  # Training vs evaluation mode
assay=False                    # Standard PubChem vs assay data

Accessing the Dataset

From Hugging Face

from datasets import load_dataset

# Load the full dataset
dataset = load_dataset("yerevann/PubChemForLM")

# Load streaming for large-scale training
dataset = load_dataset("yerevann/PubChemForLM", streaming=True)

Local JSONL Files

import glob
from datasets import load_dataset

training_data_files = glob.glob("path/to/data/*.jsonl")
dataset = load_dataset(
    "text",
    data_files={"train": training_data_files},
    streaming=True
)

Data Quality

  • All SMILES validated with RDKit
  • Properties computed from validated molecular structures
  • Similar molecules verified with Tanimoto similarity
  • Duplicate molecules removed via canonical SMILES
  • Empty fields removed
  • Float values rounded to 2 decimal places
  • Related molecules limited to 10 per entry
  • Property order randomized for training diversity

Next Steps

Model Architectures

Learn about the models

Pre-training

Train your own model

Fine-tuning

Adapt to your dataset

SMILES Format

Understand molecular representation

Build docs developers (and LLMs) love