Skip to main content
Get Modern LLM running in minutes with our smoke test, or dive straight into using pre-trained checkpoints for evaluation and text generation.

Prerequisites

Before you begin, ensure you have:
  • Python 3.9 or higher
  • Git installed
  • (Optional) CUDA-capable GPU for faster training

5-minute smoke test

Run a quick smoke test to verify your installation and see the training pipeline in action:
1

Clone the repository

git clone https://github.com/AymanMahfuz27/modern_llm.git
cd modern_llm
2

Set up virtual environment

python -m venv .venv
source .venv/bin/activate
3

Install dependencies

pip install -r requirements.txt
4

Verify installation

python -c "from modern_llm.models import ModernDecoderLM; print('✓ Installation verified')"
You should see: ✓ Installation verified
5

Run smoke test

python scripts/run_pipeline.py --config local-smoke --stage all
This runs a minimal version of the full pipeline (pretrain → SFT → DPO → verifier) in ~5 minutes.
The smoke test uses tiny model and data sizes to verify everything works. For real training, use the local or gpu config presets.

Using pre-trained checkpoints

If you want to skip training and use our pre-trained 253M parameter model:

Verify checkpoints

python scripts/verify_checkpoints.py
This checks that all checkpoint files load correctly and displays model configurations.

Run evaluation

Compare the model against GPT-2 on WikiText-2:
python scripts/evaluate_and_compare.py
Expected results:
ModelParametersPerplexity
GPT-2124M40.64
Modern LLM (pretrain)253M27.03
Modern LLM (SFT)253M34.14
Modern LLM (DPO)253M34.32

Generate text

Try text generation with different checkpoints:
from pathlib import Path
from modern_llm.training import generate_text
from modern_llm.utils.checkpointing import load_checkpoint

# Load pretrained checkpoint
checkpoint_path = Path("checkpoints/pretrain_best.pt")
model, tokenizer = load_checkpoint(checkpoint_path, device="cuda")

# Generate text
prompt = "The future of artificial intelligence is"
generated = generate_text(
    model=model,
    tokenizer=tokenizer,
    prompt=prompt,
    max_new_tokens=50,
    temperature=0.8,
    top_k=40
)
print(generated)

Run math benchmark

Evaluate the model on GSM8K grade-school math problems with verifier reranking:
python scripts/benchmark_gsm8k.py
This generates multiple candidate solutions and uses the verifier to select the best one.

Training from scratch

If you want to train your own model:
Run all training stages sequentially:
python scripts/run_pipeline.py --config local --stage all
This takes approximately 24 hours on an RTX 3060 and produces checkpoints for all stages.

Config presets

Choose a preset based on your hardware:
PresetHardwareDurationModel sizeTraining tokens
local-smokeAny (CPU/GPU)~5 min25M params200K tokens
localRTX 3060 12GB~24 hours253M params600M tokens
gpu-smokeA100/H100~2 min25M params200K tokens
gpuA100 40GB~8 hours768M params2B tokens
The gpu preset requires significant compute resources. Start with local or local-smoke for experimentation.

Common issues

Make sure you’ve installed the package and activated your virtual environment:
source .venv/bin/activate  # or .venv\Scripts\activate on Windows
pip install -r requirements.txt
You can also install in editable mode:
pip install -e .
Reduce the batch size in your configuration:
config = PipelineConfig(
    hardware_batch_size=4,           # Reduce from default 8
    hardware_gradient_accumulation_steps=4  # Increase to maintain effective batch size
)
Or use the smoke test config which has minimal memory requirements.
Checkpoints are saved in checkpoints/ directory. If you haven’t trained yet, you’ll need to either:
  1. Train your own model using the pipeline
  2. Download pre-trained checkpoints (if available)
Check the checkpoint directory:
ls -lh checkpoints/
CPU training is significantly slower than GPU. For the smoke test:
  • CPU: ~5-10 minutes
  • GPU (RTX 3060): ~2-3 minutes
Consider using Google Colab or other cloud GPU providers for faster training.

Next steps

Architecture

Learn about RoPE, RMSNorm, SwiGLU, and attention sinks

Training pipeline

Deep dive into pretrain → SFT → DPO → verifier workflow

Configuration

Customize model architecture and training hyperparameters

API reference

Explore the complete API documentation

Build docs developers (and LLMs) love