Overview
The pretraining stage teaches the model basic language understanding, grammar, facts, and reasoning patterns by training it to predict the next token in sequences from diverse text sources.Supported datasets
Modern LLM supports several high-quality text corpora for pretraining:| Dataset | Size | Description |
|---|---|---|
| wikitext-2-raw-v1 | 2M tokens | Small, high-quality Wikipedia articles (Merity et al., 2016) |
| wikitext-103-raw-v1 | 103M tokens | Larger WikiText corpus with 100+ articles |
| roneneldan/TinyStories | ~25M tokens | Simple stories for small models (Gao et al., 2023) |
| openwebtext | ~8B tokens | Reddit-curated web content, GPT-2 training set |
| wikipedia | ~4B tokens | Full Wikipedia dump (20231101.en) |
Datasets are automatically downloaded from Hugging Face on first use and cached locally.
Usage
Using the pipeline runner (recommended)
The easiest way to run pretraining is through the unified pipeline script:Direct script usage
You can also use the standalone pretraining script:Configuration
Config presets
Pretraining hyperparameters are defined in the pipeline config presets:The
local preset uses only WikiText-2 for faster training. The gpu preset uses multiple large datasets for better quality.Multi-dataset training
You can train on multiple datasets simultaneously. The datasets are concatenated and the model sees examples from all sources::N where N is the maximum number of examples:
Hyperparameter tuning
Key hyperparameters that affect pretraining quality: Learning rate (pretrain_lr)
- Default:
3e-4works well for most model sizes - Larger models (>1B params) may need lower LR (
1e-4to2e-4) - Smaller models can handle higher LR (
5e-4to1e-3)
pretrain_batch_size)
- Default:
64for local,128for GPU - Larger batches = more stable gradients but slower iteration
- Use gradient accumulation if GPU memory is limited
pretrain_warmup_steps)
- Default:
500steps - Linear warmup from 0 to
pretrain_lrover first N steps - Prevents early training instability
0.1)
- Standard L2 regularization
- Prevents overfitting on small datasets
Training details
Optimization
The pretraining implementation uses:- Optimizer: AdamW with β₁=0.9, β₂=0.999
- Learning rate schedule: Linear warmup + constant LR
- Gradient accumulation: Automatic based on batch_size / micro_batch_size
- Mixed precision: BF16 on supported GPUs, FP32 fallback
- Gradient clipping: Max norm 1.0
Loss function
Standard causal language modeling loss:Evaluation
During training, the model is evaluated on a held-out validation set everyeval_every steps (default: 500). The evaluation metric is perplexity, computed as:
Checkpoints
The pretraining stage saves checkpoints at regular intervals:-
Regular checkpoints: Every
save_everysteps (default: 2000)- Saved as
<run_name>-pretrain_step{N}.pt - Includes model state, optimizer state, config, step counter
- Saved as
-
Final checkpoint: At the end of training
- Saved as
<run_name>-pretrain_final.pt - Used as input for the SFT stage
- Saved as
torch.load):
Monitoring
Training progress is logged to:- Console output: Real-time progress bar with loss/perplexity
- Log file:
experiments/runs/<run_name>/training.log - Checkpoints: Model states saved at regular intervals
Implementation details
The pretraining implementation is located at:src/modern_llm/training/train_lm.py:run_training()- Main training loopscripts/pretrain.py- CLI wrapperscripts/run_pipeline.py:run_pretrain()- Pipeline integration
- Loads tokenizer and datasets
- Initializes model from scratch
- Sets up optimizer and LR scheduler
- Runs training loop with evaluation
- Saves final checkpoint
- Fetches each dataset from Hugging Face
- Tokenizes with padding/truncation
- Concatenates into single dataset
- Returns PyTorch Dataset object
src/modern_llm/training/train_lm.py:100-236 for full implementation.
Performance tips
Reduce memory usage
Reduce memory usage
- Lower
micro_batch_size(increases gradient accumulation) - Enable gradient checkpointing (trades compute for memory)
- Use smaller model (reduce
d_model,n_layers) - Shorter sequences (
max_seq_len)
Speed up training
Speed up training
- Increase
micro_batch_sizeif GPU has headroom - Use multiple GPUs with DDP (not yet supported)
- Enable Flash Attention (requires
use_attention_sinks=False) - Use bf16 mixed precision on Ampere+ GPUs
Improve quality
Improve quality
- Train longer (increase
pretrain_max_steps) - Use larger/more diverse datasets
- Increase model size (if compute allows)
- Lower learning rate for stability
Debug training issues
Debug training issues
- Check for NaN losses (reduce LR or enable gradient clipping)
- Verify dataset loading (check first few examples)
- Monitor perplexity trend (should decrease over time)
- Inspect generated samples (use
generate_text()after training)
Next steps
After pretraining completes:- Verify the checkpoint exists at
experiments/runs/<run_name>/pretrain_final.pt - Run SFT using this checkpoint:
- Or continue with full pipeline:
Supervised fine-tuning
Learn how to instruction-tune your pretrained model