Skip to main content

Overview

ChemLactica uses a combination of YAML configuration files and Python dataclasses to configure model architecture and training hyperparameters. This guide covers all available options.

Configuration Structure

Configuration Files

Configuration is split into two main components:
model_config, train_config = get_model_train_config(model_config_name)
  • Model Config: Architecture parameters (layers, heads, dimensions)
  • Train Config: Training hyperparameters (learning rate, batch size, etc.)

Available Model Configurations

model_config:
  n_heads: 12
  n_layers: 12
  block_size: 2048
  vocab_size: 50000
  separator_token: </s>
  separator_token_id: 2
  tokenizer_path: "./chemlactica/tokenizer/ChemLacticaTokenizer66"

Model Configuration

ModelConfig Dataclass

@dataclass
class ModelConfig:
    block_size: int = 2048
    vocab_size: int = 50000
    separator_token: str = "</s>"
    separator_token_id: int = 2
    tokenizer_path: str = "chemlactica/tokenizer/ChemLacticaTokenizer66"
block_size
integer
default:"2048"
Maximum sequence length the model can process
vocab_size
integer
default:"50000"
Size of the tokenizer vocabulary
separator_token
string
default:"</s>"
Token used to separate sequences
separator_token_id
integer
default:"2"
Token ID for the separator token
tokenizer_path
string
Path to the tokenizer directory

Training Configuration

TrainConfig Dataclass

@dataclass
class TrainConfig:
    adam_beta1: float = 0.9
    adam_beta2: float = 0.95
    batch_size: int = 500000
    dropout_prob: float = 0.1
    eval_step: int = 256
    global_gradient_norm: float = 1.0
    learning_rate_decay: float = 0.1
    max_learning_rate: float = 6.0e-4
    warmup_steps: int = 500
    weight_decay: float = 0.1
    optimizer: str = "adamw_torch"
    lr_scheduler_type: str = "linear"
    bf16: bool = True
    bf16_full_eval: bool = True
    fp16: bool = False
    tf32: bool = True
    evaluation_strategy: str = "steps"
    save_total_limit: int = 4
    grad_accumulation_scheduler: bool = False
    dynamic_grad_accumulation: bool = False
    grad_accumulation_patience: int = 4000
    grad_accumulation_max: int = 256
    grad_accumulation_delta_steps: int = 100
    grad_accumulation_delta_percentage: float = 0.02

Optimizer Configuration

optimizer
string
default:"adamw_torch"
Optimizer to use. Options: adamw_torch, adamw_hf, sgd, etc.
adam_beta1
float
default:"0.9"
Adam optimizer beta1 parameter (exponential decay rate for first moment)
adam_beta2
float
default:"0.95"
Adam optimizer beta2 parameter (exponential decay rate for second moment)
weight_decay
float
default:"0.1"
Weight decay (L2 regularization) coefficient
global_gradient_norm
float
default:"1.0"
Maximum gradient norm for gradient clipping

Learning Rate Configuration

max_learning_rate
float
default:"6.0e-4"
Maximum learning rate (peak LR after warmup)
warmup_steps
integer
default:"500"
Number of steps to warm up learning rate from 0 to max
lr_scheduler_type
string
default:"linear"
Learning rate scheduler type. Options: linear, constant_with_warmup, cosine, polynomial
learning_rate_decay
float
default:"0.1"
Learning rate decay factor (used by some schedulers)

Precision Configuration

bf16
boolean
default:"true"
Use bfloat16 mixed precision training
bf16_full_eval
boolean
default:"true"
Use bfloat16 for evaluation as well
fp16
boolean
default:"false"
Use float16 mixed precision (alternative to bf16)
tf32
boolean
default:"true"
Use TF32 precision on Ampere GPUs for matmuls

Evaluation and Checkpointing

evaluation_strategy
string
default:"steps"
When to run evaluation. Options: no, steps, epoch
eval_step
integer
default:"256"
Number of steps between evaluations (when evaluation_strategy="steps")
save_total_limit
integer
default:"4"
Maximum number of checkpoints to keep (older ones are deleted)

Gradient Accumulation

grad_accumulation_scheduler
boolean
default:"false"
Enable dynamic gradient accumulation scheduling
dynamic_grad_accumulation
boolean
default:"false"
Allow gradient accumulation steps to change during training
grad_accumulation_max
integer
default:"256"
Maximum gradient accumulation steps (when dynamic)
grad_accumulation_delta_steps
integer
default:"100"
Steps between gradient accumulation adjustments
grad_accumulation_delta_percentage
float
default:"0.02"
Percentage change in gradient accumulation per adjustment
grad_accumulation_patience
integer
default:"4000"
Steps to wait before adjusting gradient accumulation

Other Parameters

batch_size
integer
default:"500000"
Global batch size (used for calculating effective batch size)
dropout_prob
float
default:"0.1"
Dropout probability for regularization

Model-Specific Configurations

125M Configuration

# config/config_yamls/galactica_125m_pretrain_config.yaml
train_config:
  adam_beta1: 0.9
  adam_beta2: 0.95
  batch_size: 500000
  global_gradient_norm: 1.0
  max_learning_rate: 1.4e-3
  warmup_steps: 500
  weight_decay: 0.1
  bf16: true
  evaluation_strategy: "no"
  save_total_limit: 8
Recommended for:
  • Quick experimentation
  • Single GPU training
  • Limited compute resources

1.3B Configuration

train_config:
  adam_beta1: 0.9
  adam_beta2: 0.95
  global_gradient_norm: 1.0
  max_learning_rate: 1.4e-3
  warmup_steps: 500
  weight_decay: 0.1
  bf16: true
Recommended for:
  • Multi-GPU training (2-4 GPUs)
  • Production use cases
  • Good balance of performance and resources

Mistral 7B Configuration

train_config:
  max_learning_rate: 5.0e-4
  warmup_steps: 2000
  global_gradient_norm: 1.0
  weight_decay: 0.1
  adam_beta1: 0.9
  adam_beta2: 0.95
Recommended for:
  • Multi-GPU training (4-8 GPUs)
  • Large-scale pretraining
  • State-of-the-art performance

Llama 2 Configuration

train_config:
  max_learning_rate: 3.0e-5
  warmup_steps: 500
  global_gradient_norm: 0.1  # Lower gradient clipping
  weight_decay: 0.1
  adam_beta1: 0.9
  adam_beta2: 0.95
Recommended for:
  • Fine-tuning tasks
  • Leveraging pretrained Llama 2 weights
  • Chemistry-specific adaptations

SFT Configuration

# config/config_yamls/galactica_125m_sft_config.yaml
train_config:
  max_learning_rate: 1.0e-4  # Much lower than pretraining
  warmup_steps: 0             # No warmup typically
  evaluation_strategy: "steps"
  save_total_limit: 4

SFTTrainConfig

@dataclass
class SFTTrainConfig:
    packing: bool = False
    max_seq_length: int = 64
    neftune_noise_alpha: int = 10
packing
boolean
default:"false"
Pack multiple short samples into single sequences
max_seq_length
integer
default:"64"
Maximum sequence length for SFT samples
neftune_noise_alpha
integer
default:"10"
Noise level for NEFTune (embedding perturbation)

CustomArguments

Extends Hugging Face TrainingArguments with ChemLactica-specific options:
@dataclass
class CustomArguments(TrainingArguments):
    slurm_eval: bool = field(
        default=False, 
        metadata={"help": "Whether to run eval via slurm job."}
    )
    command: str = field(default=None)
    experiment_name: str = field(default=None)
    tokenizer_path: str = field(
        default="chemlactica/tokenizer/ChemLacticaTokenizer66"
    )

Key Training Arguments

From train.py:217-264:
training_args = CustomArguments(
    command=command,
    slurm_eval=slurm_eval,
    experiment_name=experiment_name,
    tokenizer_path=model_config.tokenizer_path,
    do_train=not evaluate_only,
    output_dir=checkpoints_dir,
    per_device_train_batch_size=train_batch_size,
    per_device_eval_batch_size=valid_batch_size,
    log_on_each_node=True,
    bf16=train_config.bf16,
    bf16_full_eval=train_config.bf16_full_eval,
    fp16=train_config.fp16,
    tf32=train_config.tf32,
    logging_dir=track_dir,
    learning_rate=learning_rate or train_config.max_learning_rate,
    weight_decay=train_config.weight_decay,
    adam_beta1=train_config.adam_beta1,
    adam_beta2=train_config.adam_beta2,
    warmup_steps=warmup_steps or train_config.warmup_steps,
    max_grad_norm=train_config.global_gradient_norm,
    evaluation_strategy=train_config.evaluation_strategy,
    max_steps=scheduler_max_steps,
    num_train_epochs=num_train_epochs,
    eval_steps=eval_steps,
    save_steps=save_steps,
    dataloader_drop_last=True,
    dataloader_pin_memory=True,
    dispatch_batches=False,
    dataloader_num_workers=dataloader_num_workers,
    logging_steps=1,
    gradient_checkpointing_kwargs={"use_reentrant": False},
    gradient_accumulation_steps=gradient_accumulation_steps,
    save_total_limit=train_config.save_total_limit,
    resume_from_checkpoint=resume_from_checkpoint,
    lr_scheduler_type=train_config.lr_scheduler_type,
    optim=train_config.optimizer,
)

Accelerate Configuration

For distributed training, ChemLactica uses Accelerate:
# config/accelerate_config.yaml
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: FSDP  # or DDP, MULTI_GPU
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch_policy: BACKWARD_PRE
  fsdp_sharding_strategy: 1
  fsdp_state_dict_type: FULL_STATE_DICT
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4  # Number of GPUs
use_cpu: false

FSDP Configuration

# From custom_trainer.py:77-81
fsdp_plugin.limit_all_gathers = self.args.fsdp_config.get(
    "limit_all_gathers", fsdp_plugin.limit_all_gathers
)
fsdp_plugin.activation_checkpointing = self.args.fsdp_config.get(
    "activation_checkpointing", fsdp_plugin.activation_checkpointing
)

Configuration Examples

Small-Scale Pretraining

train_config:
  max_learning_rate: 1.4e-3
  warmup_steps: 500
  global_gradient_norm: 1.0
  weight_decay: 0.1
  bf16: true
  evaluation_strategy: "steps"
  save_total_limit: 4

Large-Scale Pretraining

train_config:
  max_learning_rate: 5.0e-4
  warmup_steps: 2000
  global_gradient_norm: 1.0
  weight_decay: 0.1
  bf16: true
  evaluation_strategy: "steps"
  grad_accumulation_scheduler: true
  dynamic_grad_accumulation: true

Fine-Tuning

train_config:
  max_learning_rate: 1.0e-4
  warmup_steps: 0
  global_gradient_norm: 1.0
  weight_decay: 0.1
  bf16: true
  evaluation_strategy: "steps"
  save_total_limit: 4

Best Practices

Pretraining:
  • 125M: 1.4e-3
  • 1.3B: 1.4e-3
  • 6.7B: 1.2e-4
  • Mistral 7B: 5.0e-4
  • Llama 2: 3.0e-5
Fine-tuning:
  • 10-100x lower than pretraining
  • Typical range: 1e-5 to 1e-4
  • Pretraining: 500-2000 steps
  • SFT: 0-100 steps
  • Rule of thumb: ~1-5% of total steps
  • Larger batches = more stable training
  • Use gradient accumulation if GPU memory limited
  • Effective batch size = per_device_batch_size × num_gpus × grad_accum_steps
  • Default: 1.0 works well for most cases
  • Lower (0.1) for Llama 2 fine-tuning
  • Monitor gradient norms during training
  • BF16 recommended for Ampere GPUs (A100, A6000)
  • FP16 for older GPUs (V100)
  • TF32 provides free speedup on Ampere

Environment Variables

# From train.py:53
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "caching_allocator"
# Authentication for private models
export HF_TOKEN="your_huggingface_token"

Next Steps

Pretraining

Start pretraining with these configs

Fine-tuning

Apply configs to fine-tuning

Rejection Sampling

Configure rejection sampling

Build docs developers (and LLMs) love