Overview
Sampling strategies control how ChemLactica models select tokens during generation, balancing between generating high-quality molecules and exploring diverse chemical space.Generation Configuration
Generation parameters are specified in thegeneration_config section of your optimization configuration:
chemlactica_125m_hparams.yaml:15
Key Parameters
Temperature
Controls randomness in token selection. Higher values increase diversity.Sampling temperature for generation.
- Range: 0.1 to 2.0
- Lower values (0.5-0.8): More conservative, higher quality
- Medium values (1.0-1.2): Balanced exploration
- Higher values (1.3-1.5): More diverse, creative generation
Dynamic Temperature
During optimization, temperature increases over time to enhance exploration:optimization.py:169
chemlactica_125m_hparams.yaml:13
Max New Tokens
Maximum number of tokens to generate for each molecule.Maximum tokens to generate.
- Typical SMILES: 50-100 tokens
- Small molecules: 50 tokens
- Drug-like molecules: 100 tokens
- Complex molecules: 150+ tokens
Repetition Penalty
Penalizes repeated token sequences to encourage diversity.Penalty for repeating tokens.
- 1.0: No penalty (default)
- 1.1-1.2: Mild penalty for variety
- 1.3+: Strong penalty (may reduce quality)
Repetition penalty is usually kept at 1.0 for molecule generation, as molecular structures naturally contain repeated patterns (rings, functional groups).
Sampling Mode
Whether to use sampling or greedy decoding.
- true: Sample from probability distribution (diverse outputs)
- false: Always select highest probability token (deterministic)
EOS Token
Token ID that signals end of generation.ChemLactica uses token ID 20 as the EOS token.
Advanced Parameters
Additional parameters available for fine-grained control:Beam Search
Number of beams for beam search. Set to 1 for sampling.
Number of groups for diverse beam search.
Sequence Control
Number of sequences to return per prompt.
rejection_sampling_configs.py:8
Penalty for generating similar sequences when
num_return_sequences > 1.Penalty for sequence length. Values > 1.0 encourage longer sequences.
Configuration Examples
Conservative Generation (High Quality)
- Property prediction tasks
- When you need valid, realistic molecules
- Early stages of optimization
Balanced Generation (Default)
- General molecule generation
- Most optimization tasks
- Standard use cases
Exploratory Generation (High Diversity)
- Late-stage optimization
- Exploring novel chemical space
- When stuck in local optima
Rejection Sampling Generation
rejection_sampling_configs.py:1
- Optimization with iterative fine-tuning
- Generating many candidates per iteration
- Complex optimization landscapes
Complete Optimization Configuration
Example configuration fromchemlactica_125m_hparams.yaml:
chemlactica_125m_hparams.yaml:1
Optimization-Specific Parameters
Similarity Range
Range for random similarity values in prompts during generation.
- Encourages exploration within structural constraints
- Lower bound controls minimum similarity
- Upper bound controls maximum similarity
Generation Batch Size
Number of molecules to generate in parallel per batch.Limited by GPU memory and model size.
Generations Per Iteration
Total unique molecules to generate per optimization iteration.
Usage in Code
Loading Configuration
example_run.py:97
Passing to Optimization
example_run.py:108
Tuning Guidelines
Adjust for task
- High quality needed: Lower temperature to 0.8
- More diversity needed: Raise temperature to 1.2-1.4
Common Issues
Low validity rate
Low validity rate
Problem: Many invalid SMILES generatedSolutions:
- Lower temperature (try 0.8-0.9)
- Reduce
max_new_tokens - Check prompt format
Insufficient diversity
Insufficient diversity
Problem: Generated molecules too similarSolutions:
- Increase temperature (try 1.2-1.4)
- Widen
sim_range(e.g., [0.3, 0.95]) - Set
repetition_penaltyto 1.1
Context length exceeded
Context length exceeded
Problem: Prompts + generation exceed 2048 tokensSolutions:
- Reduce
max_new_tokens - Limit number of similar molecules in prompt
- Truncate prompt history
OOM errors
OOM errors
Problem: GPU out of memory during generationSolutions:
- Reduce
generation_batch_size - Use smaller model (125M vs 1.3B)
- Enable gradient checkpointing
Best Practices
Temperature Schedule
Start conservative (0.8-1.0), increase for exploration (1.2-1.5)
Validation
Always validate generated SMILES with RDKit before scoring
Batch Size
Maximize batch size for GPU efficiency without OOM
Token Budget
Monitor token usage to stay within 2048 context window
Next Steps
Molecule Generation
Apply sampling strategies to generate molecules
Optimization
Use generation in molecular optimization workflows