Overview
ChemLactica uses a combination of YAML configuration files and Python dataclasses to configure model architecture and training hyperparameters. This guide covers all available options.
Configuration Structure
Configuration Files
Configuration is split into two main components:
model_config, train_config = get_model_train_config(model_config_name)
Model Config : Architecture parameters (layers, heads, dimensions)
Train Config : Training hyperparameters (learning rate, batch size, etc.)
Available Model Configurations
125M
1.3B
6.7B
Mistral 7B
Llama 2
model_config :
n_heads : 12
n_layers : 12
block_size : 2048
vocab_size : 50000
separator_token : </s>
separator_token_id : 2
tokenizer_path : "./chemlactica/tokenizer/ChemLacticaTokenizer66"
model_config :
d_heads : 64
d_model : 2048
n_heads : 32
n_layers : 24
block_size : 2048
vocab_size : 50000
model_config :
d_heads : 128
d_model : 4096
n_heads : 32
n_layers : 32
block_size : 2048
vocab_size : 50000
model_config :
vocab_size : 32000
block_size : 2048
hidden_size : 4096
intermediate_size : 14336
num_hidden_layers : 32
num_attention_heads : 32
num_key_value_heads : 8
sliding_window : 512
rope_theta : 10000.0
model_config :
vocab_size : 32000
hidden_size : 4096
intermediate_size : 11008
num_hidden_layers : 32
num_attention_heads : 32
num_key_value_heads : 32
max_position_embeddings : 4096
block_size : 2048
hidden_act : "silu"
Model Configuration
ModelConfig Dataclass
@dataclass
class ModelConfig :
block_size: int = 2048
vocab_size: int = 50000
separator_token: str = "</s>"
separator_token_id: int = 2
tokenizer_path: str = "chemlactica/tokenizer/ChemLacticaTokenizer66"
Maximum sequence length the model can process
Size of the tokenizer vocabulary
Token used to separate sequences
Token ID for the separator token
Path to the tokenizer directory
Training Configuration
TrainConfig Dataclass
@dataclass
class TrainConfig :
adam_beta1: float = 0.9
adam_beta2: float = 0.95
batch_size: int = 500000
dropout_prob: float = 0.1
eval_step: int = 256
global_gradient_norm: float = 1.0
learning_rate_decay: float = 0.1
max_learning_rate: float = 6.0e-4
warmup_steps: int = 500
weight_decay: float = 0.1
optimizer: str = "adamw_torch"
lr_scheduler_type: str = "linear"
bf16: bool = True
bf16_full_eval: bool = True
fp16: bool = False
tf32: bool = True
evaluation_strategy: str = "steps"
save_total_limit: int = 4
grad_accumulation_scheduler: bool = False
dynamic_grad_accumulation: bool = False
grad_accumulation_patience: int = 4000
grad_accumulation_max: int = 256
grad_accumulation_delta_steps: int = 100
grad_accumulation_delta_percentage: float = 0.02
Optimizer Configuration
optimizer
string
default: "adamw_torch"
Optimizer to use. Options: adamw_torch, adamw_hf, sgd, etc.
Adam optimizer beta1 parameter (exponential decay rate for first moment)
Adam optimizer beta2 parameter (exponential decay rate for second moment)
Weight decay (L2 regularization) coefficient
Maximum gradient norm for gradient clipping
Learning Rate Configuration
Maximum learning rate (peak LR after warmup)
Number of steps to warm up learning rate from 0 to max
Learning rate scheduler type. Options: linear, constant_with_warmup, cosine, polynomial
Learning rate decay factor (used by some schedulers)
Precision Configuration
Use bfloat16 mixed precision training
Use bfloat16 for evaluation as well
Use float16 mixed precision (alternative to bf16)
Use TF32 precision on Ampere GPUs for matmuls
Evaluation and Checkpointing
When to run evaluation. Options: no, steps, epoch
Number of steps between evaluations (when evaluation_strategy="steps")
Maximum number of checkpoints to keep (older ones are deleted)
Gradient Accumulation
grad_accumulation_scheduler
Enable dynamic gradient accumulation scheduling
dynamic_grad_accumulation
Allow gradient accumulation steps to change during training
Maximum gradient accumulation steps (when dynamic)
grad_accumulation_delta_steps
Steps between gradient accumulation adjustments
grad_accumulation_delta_percentage
Percentage change in gradient accumulation per adjustment
grad_accumulation_patience
Steps to wait before adjusting gradient accumulation
Other Parameters
Global batch size (used for calculating effective batch size)
Dropout probability for regularization
Model-Specific Configurations
125M Configuration
# config/config_yamls/galactica_125m_pretrain_config.yaml
train_config :
adam_beta1 : 0.9
adam_beta2 : 0.95
batch_size : 500000
global_gradient_norm : 1.0
max_learning_rate : 1.4e-3
warmup_steps : 500
weight_decay : 0.1
bf16 : true
evaluation_strategy : "no"
save_total_limit : 8
Recommended for:
Quick experimentation
Single GPU training
Limited compute resources
1.3B Configuration
train_config :
adam_beta1 : 0.9
adam_beta2 : 0.95
global_gradient_norm : 1.0
max_learning_rate : 1.4e-3
warmup_steps : 500
weight_decay : 0.1
bf16 : true
Recommended for:
Multi-GPU training (2-4 GPUs)
Production use cases
Good balance of performance and resources
Mistral 7B Configuration
train_config :
max_learning_rate : 5.0e-4
warmup_steps : 2000
global_gradient_norm : 1.0
weight_decay : 0.1
adam_beta1 : 0.9
adam_beta2 : 0.95
Recommended for:
Multi-GPU training (4-8 GPUs)
Large-scale pretraining
State-of-the-art performance
Llama 2 Configuration
train_config :
max_learning_rate : 3.0e-5
warmup_steps : 500
global_gradient_norm : 0.1 # Lower gradient clipping
weight_decay : 0.1
adam_beta1 : 0.9
adam_beta2 : 0.95
Recommended for:
Fine-tuning tasks
Leveraging pretrained Llama 2 weights
Chemistry-specific adaptations
SFT Configuration
# config/config_yamls/galactica_125m_sft_config.yaml
train_config :
max_learning_rate : 1.0e-4 # Much lower than pretraining
warmup_steps : 0 # No warmup typically
evaluation_strategy : "steps"
save_total_limit : 4
SFTTrainConfig
@dataclass
class SFTTrainConfig :
packing: bool = False
max_seq_length: int = 64
neftune_noise_alpha: int = 10
Pack multiple short samples into single sequences
Maximum sequence length for SFT samples
Noise level for NEFTune (embedding perturbation)
CustomArguments
Extends Hugging Face TrainingArguments with ChemLactica-specific options:
@dataclass
class CustomArguments ( TrainingArguments ):
slurm_eval: bool = field(
default = False ,
metadata = { "help" : "Whether to run eval via slurm job." }
)
command: str = field( default = None )
experiment_name: str = field( default = None )
tokenizer_path: str = field(
default = "chemlactica/tokenizer/ChemLacticaTokenizer66"
)
Key Training Arguments
From train.py:217-264:
training_args = CustomArguments(
command = command,
slurm_eval = slurm_eval,
experiment_name = experiment_name,
tokenizer_path = model_config.tokenizer_path,
do_train = not evaluate_only,
output_dir = checkpoints_dir,
per_device_train_batch_size = train_batch_size,
per_device_eval_batch_size = valid_batch_size,
log_on_each_node = True ,
bf16 = train_config.bf16,
bf16_full_eval = train_config.bf16_full_eval,
fp16 = train_config.fp16,
tf32 = train_config.tf32,
logging_dir = track_dir,
learning_rate = learning_rate or train_config.max_learning_rate,
weight_decay = train_config.weight_decay,
adam_beta1 = train_config.adam_beta1,
adam_beta2 = train_config.adam_beta2,
warmup_steps = warmup_steps or train_config.warmup_steps,
max_grad_norm = train_config.global_gradient_norm,
evaluation_strategy = train_config.evaluation_strategy,
max_steps = scheduler_max_steps,
num_train_epochs = num_train_epochs,
eval_steps = eval_steps,
save_steps = save_steps,
dataloader_drop_last = True ,
dataloader_pin_memory = True ,
dispatch_batches = False ,
dataloader_num_workers = dataloader_num_workers,
logging_steps = 1 ,
gradient_checkpointing_kwargs = { "use_reentrant" : False },
gradient_accumulation_steps = gradient_accumulation_steps,
save_total_limit = train_config.save_total_limit,
resume_from_checkpoint = resume_from_checkpoint,
lr_scheduler_type = train_config.lr_scheduler_type,
optim = train_config.optimizer,
)
Accelerate Configuration
For distributed training, ChemLactica uses Accelerate:
# config/accelerate_config.yaml
compute_environment : LOCAL_MACHINE
deepspeed_config : {}
distributed_type : FSDP # or DDP, MULTI_GPU
fsdp_config :
fsdp_auto_wrap_policy : TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch_policy : BACKWARD_PRE
fsdp_sharding_strategy : 1
fsdp_state_dict_type : FULL_STATE_DICT
machine_rank : 0
main_process_ip : null
main_process_port : null
main_training_function : main
mixed_precision : bf16
num_machines : 1
num_processes : 4 # Number of GPUs
use_cpu : false
FSDP Configuration
# From custom_trainer.py:77-81
fsdp_plugin.limit_all_gathers = self .args.fsdp_config.get(
"limit_all_gathers" , fsdp_plugin.limit_all_gathers
)
fsdp_plugin.activation_checkpointing = self .args.fsdp_config.get(
"activation_checkpointing" , fsdp_plugin.activation_checkpointing
)
Configuration Examples
Small-Scale Pretraining
train_config :
max_learning_rate : 1.4e-3
warmup_steps : 500
global_gradient_norm : 1.0
weight_decay : 0.1
bf16 : true
evaluation_strategy : "steps"
save_total_limit : 4
Large-Scale Pretraining
train_config :
max_learning_rate : 5.0e-4
warmup_steps : 2000
global_gradient_norm : 1.0
weight_decay : 0.1
bf16 : true
evaluation_strategy : "steps"
grad_accumulation_scheduler : true
dynamic_grad_accumulation : true
Fine-Tuning
train_config :
max_learning_rate : 1.0e-4
warmup_steps : 0
global_gradient_norm : 1.0
weight_decay : 0.1
bf16 : true
evaluation_strategy : "steps"
save_total_limit : 4
Best Practices
Pretraining:
125M: 1.4e-3
1.3B: 1.4e-3
6.7B: 1.2e-4
Mistral 7B: 5.0e-4
Llama 2: 3.0e-5
Fine-tuning:
10-100x lower than pretraining
Typical range: 1e-5 to 1e-4
Pretraining: 500-2000 steps
SFT: 0-100 steps
Rule of thumb: ~1-5% of total steps
Larger batches = more stable training
Use gradient accumulation if GPU memory limited
Effective batch size = per_device_batch_size × num_gpus × grad_accum_steps
Default: 1.0 works well for most cases
Lower (0.1) for Llama 2 fine-tuning
Monitor gradient norms during training
BF16 recommended for Ampere GPUs (A100, A6000)
FP16 for older GPUs (V100)
TF32 provides free speedup on Ampere
Environment Variables
# From train.py:53
os.environ[ "PYTORCH_CUDA_ALLOC_CONF" ] = "caching_allocator"
# Authentication for private models
export HF_TOKEN = "your_huggingface_token"
Next Steps
Pretraining Start pretraining with these configs
Fine-tuning Apply configs to fine-tuning
Rejection Sampling Configure rejection sampling