TrainingConfig
Hyperparameters and bookkeeping for a single training or finetuning run. Parameters capture common training heuristics from GPT-style scaling (Kaplan et al., 2020) such as gradient accumulation, mixed precision, and logging cadence.Required parameters
Name of the training run for identification and logging.
Name or path of the dataset to use for training.
Name or path of the tokenizer (e.g., “gpt2”, “meta-llama/Llama-2-7b-hf”).
Directory to save checkpoints, logs, and artifacts. Created automatically if it doesn’t exist.
Global batch size across all gradient accumulation steps. Must be positive.
Batch size per forward/backward pass. Must be positive and not exceed
batch_size.Number of micro-batches to accumulate before updating weights. Must be positive.
Peak learning rate for the optimizer. Must be positive.
Maximum number of training steps (optimizer updates). Must be positive.
Optional parameters
Number of warmup steps for learning rate schedule. Must be non-negative.
Weight decay coefficient for AdamW optimizer. Must be non-negative.
Maximum gradient norm for gradient clipping. Must be positive.
Evaluate on validation set every N steps. Must be non-negative.
Save checkpoint every N steps. Must be non-negative.
Log training metrics every N steps. Must be non-negative.
Random seed for reproducibility. Set to None for non-deterministic training.
Mixed precision training dtype. “bf16” recommended for modern GPUs.
Trade compute for memory by recomputing activations during backward pass.
Use torch.compile for significant speedup on modern GPUs (PyTorch 2.0+).
Example
Validation rules
batch_size,micro_batch_size,gradient_accumulation_steps, andmax_stepsmust be positivemicro_batch_sizecannot exceedbatch_sizewarmup_steps,eval_every,save_every, andlog_everymust be non-negativelearning_rateandmax_grad_normmust be positiveweight_decaymust be non-negativemixed_precisionmust be one of “bf16”, “fp16”, or “fp32”output_diris automatically created if it doesn’t exist