Overview
ChemLactica supports pretraining language models on chemical data using thetrain.py script. The training system is built on top of Hugging Face Transformers with custom features for efficient distributed training.
Quick Start
Prepare Your Data
Organize your training data in JSONL format in directories. You’ll need:
- Training data directories (can be multiple)
- Validation data directory
- Data type labels for each directory
Choose Model Configuration
Select a model configuration from the available options:
125m- 125 million parameter model1.3b- 1.3 billion parameter model6.7b- 6.7 billion parameter modelmistral7b- Mistral 7B based modelllama2- Llama 2 based model
Basic Training Command
Command-Line Arguments
Required Arguments
Type of training to perform. Options:
pretrain, sft, isft, dpoPath to pretrained model directory or Hugging Face model identifier
Model configuration name (e.g.,
125m, 1.3b, llama2)List of directories containing training data in JSONL format
Data type labels for each training directory (same order as
training_data_dirs)Directory containing validation data
Training batch size per GPU
Number of training steps between evaluations
Number of steps between checkpoint saves
Root directory for saving model checkpoints
Optional Arguments
Learning rate (defaults to config value if not specified)
Number of warmup steps for learning rate scheduler
Maximum number of training steps (overrides
num_train_epochs)Number of training epochs
Number of steps for LR scheduler (defaults to
max_steps)Validation batch size (defaults to
train_batch_size)Buffer size for dataset shuffling
Name for the experiment (used in tracking)
Number of dataloader worker processes
Number of gradient accumulation steps
Enable gradient checkpointing to save memory
Use Flash Attention 2 for faster training
Tracking and Debugging
Enable experiment tracking with Aim
Directory for saving tracking data
Enable PyTorch profiler for performance analysis
Directory for profiling output
Enable reproducibility checks (for testing only)
Advanced Usage
Multi-GPU Training
ChemLactica uses Accelerate for distributed training. Launch with:Resume from Checkpoint
To resume training from a checkpoint, pass the checkpoint directory path to--from_pretrained:
Multiple Data Sources
You can train on multiple data directories with different types:Dynamic Gradient Accumulation
Enable automatic gradient accumulation scheduling:Training Features
Custom Callbacks
ChemLactica includes several custom callbacks:- WPSCounterCallback: Tracks words per second
- CustomProgressCallback: Enhanced progress reporting with FLOPS tracking
- EarlyStoppingCallback: Stops training at specified steps
- JsonlDatasetResumeCallback: Handles dataset resumption for streaming data
- ReproducibilityCallback: Validates training reproducibility
- GradientAccumulationScheduler: Dynamically adjusts gradient accumulation
Memory Optimization
Checkpoint Management
Checkpoints are saved with the following structure:save_total_limit configuration controls how many checkpoints are kept:
Monitoring Training
Aim Tracking
When--track is enabled, training metrics are logged to Aim:
Key Metrics
- Training loss
- Validation loss
- Learning rate
- Words per second (WPS)
- Gradient norm
- FLOPS utilization
Example Configurations
Troubleshooting
Out of Memory
If you encounter OOM errors:- Reduce
train_batch_size - Increase
gradient_accumulation_steps - Enable
--gradient_checkpointing - Use
--flash_attnif available - Reduce
save_total_limitto save memory
Slow Training
- Enable
--flash_attnfor 2-3x speedup - Increase
dataloader_num_workers - Adjust
shuffle_buffer_size - Use mixed precision training (bf16 is enabled by default)
Checkpoint Loading Issues
Make sure the checkpoint path includes the full path to the checkpoint directory, not just the parent directory.Next Steps
Fine-tuning
Learn how to fine-tune models for specific tasks
Configuration
Explore all training configuration options