Skip to main content

Overview

ChemLactica supports pretraining language models on chemical data using the train.py script. The training system is built on top of Hugging Face Transformers with custom features for efficient distributed training.

Quick Start

1

Prepare Your Data

Organize your training data in JSONL format in directories. You’ll need:
  • Training data directories (can be multiple)
  • Validation data directory
  • Data type labels for each directory
2

Choose Model Configuration

Select a model configuration from the available options:
  • 125m - 125 million parameter model
  • 1.3b - 1.3 billion parameter model
  • 6.7b - 6.7 billion parameter model
  • mistral7b - Mistral 7B based model
  • llama2 - Llama 2 based model
3

Run Training

Execute the training script with your configuration.

Basic Training Command

python chemlactica/train.py \
  --train_type pretrain \
  --from_pretrained facebook/galactica-125m \
  --model_config 125m \
  --training_data_dirs /path/to/train/data \
  --dir_data_types smiles \
  --valid_data_dir /path/to/valid/data \
  --train_batch_size 16 \
  --eval_steps 500 \
  --save_steps 1000 \
  --max_steps 10000 \
  --checkpoints_root_dir ./checkpoints \
  --experiment_name my_experiment

Command-Line Arguments

Required Arguments

--train_type
string
required
Type of training to perform. Options: pretrain, sft, isft, dpo
--from_pretrained
string
required
Path to pretrained model directory or Hugging Face model identifier
--model_config
string
required
Model configuration name (e.g., 125m, 1.3b, llama2)
--training_data_dirs
string[]
required
List of directories containing training data in JSONL format
--dir_data_types
string[]
required
Data type labels for each training directory (same order as training_data_dirs)
--valid_data_dir
string
required
Directory containing validation data
--train_batch_size
integer
required
Training batch size per GPU
--eval_steps
integer
required
Number of training steps between evaluations
--save_steps
integer
required
Number of steps between checkpoint saves
--checkpoints_root_dir
string
required
Root directory for saving model checkpoints

Optional Arguments

--learning_rate
float
Learning rate (defaults to config value if not specified)
--warmup_steps
integer
Number of warmup steps for learning rate scheduler
--max_steps
integer
default:"-1"
Maximum number of training steps (overrides num_train_epochs)
--num_train_epochs
integer
Number of training epochs
--scheduler_max_steps
integer
Number of steps for LR scheduler (defaults to max_steps)
--valid_batch_size
integer
Validation batch size (defaults to train_batch_size)
--shuffle_buffer_size
integer
default:"4"
Buffer size for dataset shuffling
--experiment_name
string
default:"none"
Name for the experiment (used in tracking)
--dataloader_num_workers
integer
default:"0"
Number of dataloader worker processes
--gradient_accumulation_steps
integer
default:"1"
Number of gradient accumulation steps
--gradient_checkpointing
boolean
default:"false"
Enable gradient checkpointing to save memory
--flash_attn
boolean
default:"false"
Use Flash Attention 2 for faster training

Tracking and Debugging

--track
boolean
default:"true"
Enable experiment tracking with Aim
--track_dir
string
Directory for saving tracking data
--profile
boolean
default:"false"
Enable PyTorch profiler for performance analysis
--profile_dir
string
Directory for profiling output
--check_reproducability
boolean
default:"false"
Enable reproducibility checks (for testing only)

Advanced Usage

Multi-GPU Training

ChemLactica uses Accelerate for distributed training. Launch with:
accelerate launch --config_file config/accelerate_config.yaml \
  chemlactica/train.py \
  --train_type pretrain \
  --from_pretrained facebook/galactica-125m \
  --model_config 125m \
  # ... other arguments

Resume from Checkpoint

To resume training from a checkpoint, pass the checkpoint directory path to --from_pretrained:
python chemlactica/train.py \
  --train_type pretrain \
  --from_pretrained ./checkpoints/facebook/galactica-125m/hash123/checkpoint-5000 \
  --model_config 125m \
  # ... other arguments

Multiple Data Sources

You can train on multiple data directories with different types:
python chemlactica/train.py \
  --training_data_dirs /path/to/smiles /path/to/reactions /path/to/properties \
  --dir_data_types smiles reactions properties \
  # ... other arguments

Dynamic Gradient Accumulation

Enable automatic gradient accumulation scheduling:
# In your config YAML
train_config:
  grad_accumulation_scheduler: true
  dynamic_grad_accumulation: true
  grad_accumulation_max: 256
  grad_accumulation_delta_steps: 100
  grad_accumulation_delta_percentage: 0.02
  grad_accumulation_patience: 4000

Training Features

Custom Callbacks

ChemLactica includes several custom callbacks:
  • WPSCounterCallback: Tracks words per second
  • CustomProgressCallback: Enhanced progress reporting with FLOPS tracking
  • EarlyStoppingCallback: Stops training at specified steps
  • JsonlDatasetResumeCallback: Handles dataset resumption for streaming data
  • ReproducibilityCallback: Validates training reproducibility
  • GradientAccumulationScheduler: Dynamically adjusts gradient accumulation

Memory Optimization

python chemlactica/train.py \
  --gradient_checkpointing \
  # ... other arguments

Checkpoint Management

Checkpoints are saved with the following structure:
checkpoints_root_dir/
└── organization/
    └── model_name/
        └── experiment_hash/
            ├── checkpoint-1000/
            ├── checkpoint-2000/
            └── last/
The save_total_limit configuration controls how many checkpoints are kept:
# From train.py:259
save_total_limit=train_config.save_total_limit  # Default: 4

Monitoring Training

Aim Tracking

When --track is enabled, training metrics are logged to Aim:
python chemlactica/train.py \
  --track \
  --track_dir ./aim_logs \
  --experiment_name chemistry_pretrain \
  # ... other arguments
View metrics with:
aim up --repo ./aim_logs

Key Metrics

  • Training loss
  • Validation loss
  • Learning rate
  • Words per second (WPS)
  • Gradient norm
  • FLOPS utilization

Example Configurations

python chemlactica/train.py \
  --train_type pretrain \
  --from_pretrained facebook/galactica-125m \
  --model_config 125m \
  --training_data_dirs ./data/train \
  --dir_data_types smiles \
  --valid_data_dir ./data/valid \
  --learning_rate 1.4e-3 \
  --warmup_steps 500 \
  --train_batch_size 32 \
  --eval_steps 500 \
  --save_steps 1000 \
  --max_steps 50000 \
  --checkpoints_root_dir ./checkpoints \
  --experiment_name galactica_125m_pretrain

Troubleshooting

Out of Memory

If you encounter OOM errors:
  1. Reduce train_batch_size
  2. Increase gradient_accumulation_steps
  3. Enable --gradient_checkpointing
  4. Use --flash_attn if available
  5. Reduce save_total_limit to save memory

Slow Training

  • Enable --flash_attn for 2-3x speedup
  • Increase dataloader_num_workers
  • Adjust shuffle_buffer_size
  • Use mixed precision training (bf16 is enabled by default)

Checkpoint Loading Issues

Make sure the checkpoint path includes the full path to the checkpoint directory, not just the parent directory.

Next Steps

Fine-tuning

Learn how to fine-tune models for specific tasks

Configuration

Explore all training configuration options

Build docs developers (and LLMs) love