Prerequisites
Before you begin, ensure you have:- Python 3.9 or higher
- Git installed
- (Optional) CUDA-capable GPU for faster training
5-minute smoke test
Run a quick smoke test to verify your installation and see the training pipeline in action:Using pre-trained checkpoints
If you want to skip training and use our pre-trained 253M parameter model:Verify checkpoints
Run evaluation
Compare the model against GPT-2 on WikiText-2:| Model | Parameters | Perplexity |
|---|---|---|
| GPT-2 | 124M | 40.64 |
| Modern LLM (pretrain) | 253M | 27.03 |
| Modern LLM (SFT) | 253M | 34.14 |
| Modern LLM (DPO) | 253M | 34.32 |
Generate text
Try text generation with different checkpoints:Run math benchmark
Evaluate the model on GSM8K grade-school math problems with verifier reranking:Training from scratch
If you want to train your own model:- Full pipeline
- Individual stages
- Custom config
Run all training stages sequentially:This takes approximately 24 hours on an RTX 3060 and produces checkpoints for all stages.
Config presets
Choose a preset based on your hardware:| Preset | Hardware | Duration | Model size | Training tokens |
|---|---|---|---|---|
local-smoke | Any (CPU/GPU) | ~5 min | 25M params | 200K tokens |
local | RTX 3060 12GB | ~24 hours | 253M params | 600M tokens |
gpu-smoke | A100/H100 | ~2 min | 25M params | 200K tokens |
gpu | A100 40GB | ~8 hours | 768M params | 2B tokens |
Common issues
ImportError: No module named 'modern_llm'
ImportError: No module named 'modern_llm'
Make sure you’ve installed the package and activated your virtual environment:You can also install in editable mode:
CUDA out of memory error
CUDA out of memory error
Reduce the batch size in your configuration:Or use the smoke test config which has minimal memory requirements.
Checkpoint files not found
Checkpoint files not found
Checkpoints are saved in
checkpoints/ directory. If you haven’t trained yet, you’ll need to either:- Train your own model using the pipeline
- Download pre-trained checkpoints (if available)
Slow training on CPU
Slow training on CPU
CPU training is significantly slower than GPU. For the smoke test:
- CPU: ~5-10 minutes
- GPU (RTX 3060): ~2-3 minutes
Next steps
Architecture
Learn about RoPE, RMSNorm, SwiGLU, and attention sinks
Training pipeline
Deep dive into pretrain → SFT → DPO → verifier workflow
Configuration
Customize model architecture and training hyperparameters
API reference
Explore the complete API documentation